Submitting Data to NeMO - Overview

Data submitted to NeMO falls into three categories:

  • Public - data to be immediately distributed openly and freely to the wider research community,
  • Embargo - data to be held back, or embargoed, until a specific date, at which point it will be released openly and freely to the wider research community,
  • Restricted - Controlled access data to be distributed only to an approved group of users due to consent restrictions, e.g. human data. Often restricted datasets contain a combination of private (raw reads, alignments) and public (counts, peaks) datatypes. In such instances, users will submit all data to the Restricted area and we will work with them to obtain consent for releasing any files publicly.

In this document:

 

Requesting a NeMO Aspera Account

New data submitters can register for an account at nemoarchive.org/register. Upon doing so, please send an email to nemo@som.umaryland.edu with the following information:

  • requested username
  • grant under which data was generated
  • PI of lab in which data was generated
  • whether data is public, for embargo or restricted*

*Restricted data submitters are asked to request account usernames with an appended "-restricted", e.g. doe-restricted. If you will be submitting both public and controlled access data, you will need to create two accounts, one without and one with the "-restricted" tag.

Someone from the NeMO team will follow up with you regarding any other questions about your data or the submission process.

 

Submitting a File Manifest

All submissions to NeMO Archives, whether public or private, begin with upload of a tab delimited (tsv) file manifest including MD5 checksums through the User Dashboard available to logged in users at https://nemoarchive.org/login. More instructions on submission are below. This enables us to collect some basic information in order to organize and process your data properly for release. This is not a comprehensive metadata collection and should be straightforward to populate.

Detection and validation of a properly formatted manifest will trigger a message providing the Aspera submission path for your data, see below.

Manifest format

File manifest template in tsv format

Excel-friendly manifest file with field descriptions and controlled vocabularies - Please contact NeMO to discuss addition of new terms to the any of the controlled vocabularies.

All manifest files must contain the following fields. Do not delete headers for unused, optional columns:

  • File name - must be unique, described below
  • Sample ID
  • Program (e.g. BICCN)
  • Sub-program (e.g. U19 Zeng)
  • Lab
  • Species
  • Modality
  • Technique
  • Subspecimen type
  • Data type
  • File type
  • Access
  • Checksum - described below
  • Anatomical site (optional for current release cycle)
  • Counts data only fields (optional):
    • Counts pipeline
  • Aligned data only fields (optional):
    • Read aligner
    • Genome build
    • Gene set release
  • BRAIN initiative only fields (optional, but highly recommended):
    • BCDC Project name
    • BCDC data collection
  • Controlled access only fields (optional):
    • Usage (e.g. Limited/GRU/NA)
    • Institutional Certification ID
    • Donor id/name
    • Tissue provider

The manifest file must contain one row, including an MD5 checksum, for every file included in the submission. If you're submitting a tarball, the manifest should contain one row for every file within the tarball. For submission of multiple, or chunked, tarballs, there is no need for a manifest per tarball, all component files can be included in a single manifest.

The first column should contain the file name only, with no path information. File name must be unique. In the case of sample-specific bundles, for example mex bundles containing a genes, features and barcodes file, prepend the sample id to your file name. We will strip that off during processing and place the three files in a sample id directory, as expected by many downstream tools. As an example, three files are submitted to NeMO with the following sample names:

sample1_genes.tsv
sample1_features.tsv
sample1_barcodes.tsv

These files will be released as three files within the sample1 subdirectory:

sample1/genes.tsv
sample1/features.tsv
sample1/barcodes.tsv

Important:
If populating the manifest in Excel or Numbers, save as a tab delimited file. The manifest file can have any prefix, however the base filename must be manifest.tsv.  

Manifest submission and validation

To submit your manifest for validation, log into NeMO using your NeMO Aspera credentials. Once logged in, select Dashboard in the top navigation. This will take you to a page with links to make a submission, or view your current submissions (described below).

Upon upload, file manifests will be immediately validated for:

  • complete header row
  • presence of checksums and all other required fields
  • proper controlled vocabularies
  • detection of unexpected file extensions
  • presence of all components of various bundle types

Validation failure in any row(s) will result in an error message displayed on the submission page, accompanied by an invalid manifest validation email with additional details about the error, if applicable. Once you have corrected the errors, your updated validation file is submitted in the same way and undergoes the same validation.

Validation success will result in notification, typically within minutes, via email, of the Aspera path to which you will submit your data, as described below. Be sure to submit to the provided path, as any other submissions will be ignored and routinely deleted.

 

Submitting Data

NeMO accepts both raw and derived data types of the file extensions listed here. If you plan to submit data extensions not currently on this list, please let us know ahead of time to avoid manifest upload and/or processing delays, by emailing nemo@som.umaryland.edu.

Depending on the size of your submission, it may be preferable to submit data files individually or within a tarball. Either is fine. We don't pre-define any specific directory structure for your submission, however we ask that you do not submit tarballs within a tarball, as that will delay processing. Files will be reorganized during ingest based on metadata provided in the manifest.

Uploading using Aspera

In order to submit data to NeMO, you will first need to download and install the IBM Aspera Command Line Interface, available from the IBM website. You will find the download link under Featured client software > IBM Aspera Command Line Interface. Select the most recent release for your operating system. You will need to create an IBM account if you do not already have one. Installation instructions are available here.

Once installed, the ascp utility will be available for use at the command line. Have your NeMO Aspera credentials handy, as commands to initiate an upload will result in a prompt for your password.

Uploading with the Aspera client uses the following syntax:

$ ascp [-l <Maximum download speed>] [-k 2] [-m <Minimum download speed>] [-Q] [-T] \
<path to local file or directory> <username>@aspera.nemoarchive.org \
    provided/path/to/NeMO/directory/

All parameters between the [ ... ] brackets are optional parameters described here:

l - Allows for the user to set a maximum upload/download speed that aspera should attempt to stay at or below for the duration of the transfer. A speed in Megabits must be provided with this flag.
k - Allow for resumable data transmission in case an interruption occurs.
m - Allows for the user to set a minimum upload/download sped that aspera should attempt to stay at or above for the duration of the transfer. A speed in Megabits must be provided with this flag.
Q - Turns adaptive rate on. Adaptive rate controls the speed of aspera with a goal of not dominating the bandwidth available. Very useful on busy networks that may have other transfers ongoing.
T - Turns encryption off. Turning encryption off will allow for a maximum throughput transfer but should not be provided if data being uploaded is sensitive.

In the following example, my_data.tar.gz is being transferred to NeMO,

$ ascp -l 100M -k 2 -QT /home/user/my_data/my_data.tar.gz user123@aspera.nemoarchive.org:v34kltf7/

where v34kltf7 is the directory name that was provided to user123 upon manifest validation. Uploading a directory of files would not require any changes as Aspera will recognize that a folder is being transferred and will recursively step through ensuring all files found in the directory are transferred.

$ ascp -l 100M -k 2 -QT /home/user/my_data/ user123@aspera.nemoarchive.org:v34kltf7/

It is not possible to delete or update data using the Aspera CLI, however submitting a file of the same name will overwrite the previous file. Should you wish to delete data, reach out to a NeMO team member at nemo@som.umaryland.edu.

NEW: Once your data upload has completed, please run one more Aspera upload of a single, empty txt file titled 'DONE'. This will prompt our system to begin scanning your submission.

 

Processing Workflow

 

submit_dataflow

Upon successful bundling of data to the release area (either public of restricted), submitters will receive a "Successful ingest" email containing your validated manifest with additional fields capturing final file location and NeMO identifiers for individual files. This email will also list any directories or files skipped due to not being listed in the manifest, for your review.

 

Restricted Data

For restricted, or controlled access, data submissions, the validation and ingest process is the same.
As described above, NeMO Aspera accounts for restricted data submissions should be created with an appended "-restricted", e.g. doe-restricted. Restricted accounts also need to be set to mount to the proper area on our restricted server. A NeMO team member will reach out, distinctly from the automated account approval email, once your account is ready for restricted data submission, so please plan accordingly for this minimal extra time required prior to data submission.

Restricted Data Release documentation outlines steps required to make restricted datasets available to approved users or the public, as applicable.

 

Tracking Submissions

Submitters can track the progress of data submissions from manifest validation through release to the NeMO Portal. To track submission, log into the NeMO website using your NeMO Aspera credentials. Once logged in, select Dashboard in the top navigation, and select Submission status.
This image shows the status for an individual manifest. Clicking on any of the completed, green checkmarks provides additional details: date, time and explanation of the completed step.

tracking database

Incomplete steps are shown in grey, failed steps in red. Failed steps should be resolved internally by the NeMO team, and we will reach out if you need to resubmit all or a portion of your data.

If you are a project manager, a PI or other role responsible for tracking submissions, and would like to have access to the tracking dashboard for one or a number of labs or submitters, please create a NeMO Aspera account at nemoarchive.org/register and email nemo@som.umaryland.edu explaining your role and tracking request.

 

Reporting to NIH

Depending on funding source, data submitters may be required to submit quarterly proof of data submissions. At this time, NIH is accepting the "Successful ingest" email received once data is successfully bundled as a submission receipt for an individual manifest.

Questions or Comments? Please contact us.