Content from Data Tidiness


Last updated on 2025-04-16 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • What metadata should I collect?
  • How should I structure my sequencing data and metadata?
  • Which ethical considerations involve microbiome data and metadata?

Objectives

  • Think about and understand the types of metadata a sequencing experiment will generate.
  • Understand the importance of metadata and potential metadata standards.
  • Explore common formatting challenges in spreadsheet data.
  • Consider confidentiality issues that may arise from storing and sharing the metadata.

When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing center, but just as important, if not more so, is the data you have generated about the sequences before they ever go to the sequencing center. Data about the data are often called the metadata. The sequence data itself is useless without the information about what you sequenced.

Discussion 1

With the person next to you, discuss:

What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?

Types of files and information you have generated:
- Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study. - Lab notebook notes about how you conducted those experiments. - Spreadsheet or tabular data about the samples you sent for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration, and other information. - Lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you are doing, e.g., paired-end Illumina HiSeq. There likely will be other ideas here too. Was this more information and data than you were expecting?

All the data and information discussed can be considered metadata, i.e., data about the data. We want to follow a few guidelines for metadata.

Notes about your experiment


Notes about your experiment, including how you prepared your samples for sequencing, should be in your lab notebook, whether a physical lab notebook or an electronic one. For guidelines on good lab notebooks, see the Howard Hughes Medical Institute “Making the Right Moves: A Practical Guide to Scientifıc Management for Postdocs and New Faculty” the section on Data Management and Laboratory Notebooks.

Including dates on your lab notebook pages, the samples themselves, and in any records about those samples help you associate everything with each other later. Using dates also helps create unique identifiers because even if you process the same sample twice, you do not usually do it on the same day, or if you do, you are aware of it and give them names like A and B.

Unique identifiers

Unique identifiers are unique names for a sample or set of sequencing data. They are names for that data that only exist for that data. Having these unique names makes them much easier to track later.

Metadata: Data about the experiment


Data about the experiment is usually collected in spreadsheets, like Excel. Alongside the spreadsheet, creating a text file called README is convenient. This file holds information about the research project, how the samples were generated, and how to read the metadata spreadsheet. This information is very convenient when working on a team, and more people join the project even at different experiment steps. This order may help better your collaborators to easily understand what the data is about.

Metadata standards

The kind of data you would collect depends on your experiment; often, there are guidelines from metadata standards. Many fields have particular ways that they structure their metadata, so it is consistent and can be used across the area. The Digital Curation Center maintains a list of metadata standards and some that are particularly relevant for genomics data are available from the Genomics Standards Consortium. In particular, assembly quality and an estimate of genome completeness and contamination are standards for Metagenome-Assambled-Genomes (MAGs).

Cornell University gives us a useful guide and template file to write README-style metadata in case there are no metadata standards for your type of data. When you submit data to an organization, they may give you a file with specifications about
the metadata needed for their platform. For example, MG-RAST gives a file like this.

Discussion 2. What information would you write in your README file?

Suppose that in your field, there are no metadata standards yet. Think about the minimum amount of information that someone would need to be able to work with your data without talking to you. What type of information would you put in your README file?

Some examples of clarifications that need to be written in the README are: * Date format (mm-dd-yyyy or dd-mm-yyyy, for example). * Meaning of abbreviations. * Meaning or pattern followed to construct the unique IDs of samples. * Details about the methodology. * Contact information about the persons who performed the collection and/or experiments. * Meaning of each variable name.

Discussion 3: Ethical considerations in microbiome studies

The microbiomes knowledge is human’s global heritage and must be ruled by ethical principles such as do good, do not harm, respect, and act justly. While studying the human microbiome, there are particular concerns. For example, if you discover an infection such as HIV in the blood microbiome, would you inform the participants? What If they did not want to know? Following these principles, what other ethical considerations can you think about human microbiomes?

  1. Respect: Ask participants to sign explicit prior informed consent.
  2. Do good: Share the data but protect participant’s privacy
  3. Do no harm: establish a policy about results communications.
  4. Do no harm: Consider the Invasiveness of sampling and minimize the risk
  5. Act justly: Favor diversity of subjects and justice
    More information about these topics is developed by Mc Guire et al. 2008 in the paper:Ethical, legal, and social considerations in conducting the Human Microbiome Project and by Lange et al. 2022 in: Microbiome ethics, guiding principles for microbiome research, use and knowledge management

Structuring data in spreadsheets

Independent of the type of data you are collecting, there are standard ways to enter that data into the spreadsheet to make it easier to analyze later. We often enter data that makes it easy for us as humans to read and work with it because we are human! Computers need data structured in a way that they can use it. So to use this data in a computational workflow, we need to think like computers when we use spreadsheets.

The cardinal rules of using spreadsheet programs for data:

  • Leave the raw data raw - do not change it!
  • Put each observation in its own row. An observation is each of our samples, the subjects for which we store information in the spreadsheet.
  • Put all your variables in columns. The variables are the different pieces of information that we have about our sample (its genotype, phenotype, treatment, etc.).
  • Have column names explanatory but without spaces. Use ‘-’, ’_’ or camel case instead of a space. For instance, ‘library-prep-method’ or ‘LibraryPrep’is better than ’library preparation method’ or ‘prep’, because computers interpret spaces in particular ways.
  • Do not combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that is the only way you will want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g., E. coli K12) you would have one column with the species name (E. coli) and another with the strain name (K12). Depending on the type of analysis you want to do, you may even separate the genus and species names into distinct columns.
  • Export the cleaned data to a text-based format like CSV (comma-separated values). This format ensures that anyone can use the data required by most data repositories.

Messy spreadsheet

Discussion 4. Spreadsheet organization.

This is a potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some problems with the spreadsheet data shown above. You can look at the image or download the file to your computer via this link and open it in a spreadsheet reader like Excel.

There are a few potential errors to be on the lookout for in your own data and data from collaborators or the Internet. Suppose you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation. In that case, it might motivate you and your project members to try and avoid them. Making small changes to how you format your data in spreadsheets can greatly impact efficiency and reliability when it comes to data cleaning and analysis.

  • Using multiple tables
  • Using multiple tabs
  • Not filling in zeros
  • Using problematic null values
  • Using formatting to convey information
  • Using formatting to make the datasheet look pretty
  • Placing comments or units in cells
  • Entering more than one piece of information in a cell
  • Using problematic field names
  • Using special characters in data
  • Inclusion of metadata in the data table
  • Date formatting

You can keep exploring issues on the metadata file in the Data Carpentry Ecology spreadsheet lesson. Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, and spaces in some columns names. Here is a “clean” version of the same spreadsheet:

Cleaned spreadsheet Download the file using right-click (PC)/command-click (Mac).

Further notes on data tidiness


Data organization at this point of your experiment will help facilitate your analysis later, as well as prepare your data and notes for data deposition now often required by journals and funding agencies. If this is a collaborative the project, as most projects are now, it is also information that collaborators will need to interpret your data and results and is very useful for communication and efficiency.

Fear not! If you have already started your project and it is not set up this way, there are still opportunities to make updates. One of the biggest challenges is tabular data that is not formatted so computers can use it or has inconsistencies that make it hard to analyze.

When working in the command line, it is problematic to put spaces in the names of directories and files.

More practice on how to structure data is outlined in our Data Carpentry Ecology spreadsheet lesson

Tools like OpenRefine can help you clean your data.

Key Points

  • Assigning and keeping track of appropriate unique ID identifiers must be a well-thought process.
  • Metadata is key for you and others to work with your data.
  • Tabular data needs to be structured to work with it effectively.
  • Human microbiome data requires informed consent and confidentiality.

Content from Planning for NGS Projects


Last updated on 2025-04-16 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How do I plan and organize a genome sequencing project?
  • What information does a sequencing facility need?
  • What are the guidelines for data storage?

Objectives

  • Understand the data we send to and get back from a sequencing center.
  • Make decisions about how (if) data will be stored, archived, shared, etc.

Large datasets


There are a variety of ways to work with a large sequencing dataset. You may be a novice who has not used bioinformatics tools beyond doing BLAST searches. You may have bioinformatics experience with other data types and are working with high-throughput (NGS) sequence data for the first time. In the most important ways, the methods and approaches we need in bioinformatics are the same ones we need at the bench or in the field - planning, documenting, and organizing are the key to good reproducible science.

Discussion 1

Before we go any further, here are some important questions. If you are learning at a workshop, please discuss these questions with your neighbor.

Working with sequence data

What challenges do you think you’ll face (or have already faced) in working with a large sequence dataset?
Where/how will you (did you) analyze your data - what software, what computer(s)? What is your strategy for saving and sharing your sequence files?
How can you be sure that your raw data has not been unintentionally corrupted?

With large datasets is hard to have enough storage space for your data and your results, and is hard to anticipate how much disc space and processing time you need for every step of your pipelines, including intermediate files and results.
It is hard to identify errors when you have too many files and too many observations in a spreadsheet.
Some programs may not work or perform poorly with many files. Data can be protected by removing file permissions and having copies.

Sending samples to the facility


The first step in sending your sample for sequencing will be to complete a form documenting the metadata for the facility. Take a look at the following example submission spreadsheet.

Sample submission sheet

Download the file using right-click (PC)/command-click (Mac). This file is a tab-delimited text file. Try opening it with Excel or another spreadsheet program.

Exercise 1: Identifying errors

  1. What are some errors you can spot in the data? Typos, missing data, inconsistencies?
  2. What improvements could be made to the choices in naming?
  3. What are some errors in the spreadsheet that would be difficult to spot? Is there any way you can test this?

Errors: - Sequential order of well_position changes - Format of client_sample_id changes and cannot have spaces, slashes, non-standard ASCII characters - Capitalization of the replicate column changes - Volume and concentration column headers have unusual (not allowed) characters - Volume, concentration, and RIN column decimal accuracy changes - The prep_date and ship_date formats are different, and prep_date has multiple formats - Are there others not mentioned?

Improvements in naming - Shorten client_sample_id names, and maybe just call them “names” - For example: “wt” for “wild-type”. Also, they are all “1hr”, so that is superfluous information - The prep_date and ship_date might not be needed - Use “microliters” for “Volume (µL),” etc.

Errors hard to spot: - No space between “wild” and “type”, repeated barcode numbers, missing data, duplicate names - Find by sorting, or counting

Retrieving sample sequencing data from the facility


When the data come back from the sequencing facility, you will receive some documentation (metadata) as well as the sequence files themselves. Download and examine the following example file - here provided as a text file and Excel file:

Exercise 2: Exploring sequencing metadata

  1. How are these samples organized?
  2. If you wanted to relate file names to the sample names submitted above (e.g., wild type), could you do so?
  3. What do the _R1/_R2 extensions mean in the file names?
  4. What does the ‘.gz’ extension on the filenames indicate?
  5. What is the total file size - what challenges in downloading and sharing these data might exist?
  1. Samples are organized by sample_id
  2. To relate filenames, use the sample_id and do a VLOOKUP on the submission sheet
  3. The _R1/_R2 extensions mean “Read 1” and “Read 2” of each sample
  4. The ‘.gz’ extension means it is a compressed “gzip” type format to save disk space
  5. The size of all the files combined is 1113.60 Gb (over a terabyte!). To transfer files this large, you should validate the file size following the transfer. Absolute file integrity checks following transfers and methods for faster file transfers are possible but beyond the scope of this lesson.

Storing data


The raw data you get back from the sequencing center is the foundation of your sequencing analysis. You need to keep this data so that you can always come back to it if there are any questions or if you need to re-run an analysis or try a new approach.

Guidelines for storing data

  • Store the data in a place that is accessible to you and other members of your lab. At a minimum, you and the head of your lab should have access.
  • Store the data in a place that is redundantly backed up. It should be backed up in two locations in different physical areas.
  • Leave the raw data raw. You will be working with this data, but you don’t want to modify this stored copy of the original data. If you modify the data, you’ll never be able to access those original files. We will cover how to avoid accidentally changing files in a later lesson in this workshop (see File Permissions).

Some data storage solutions

Those are ideal locations if you have a local high-performance computing center or data storage facility on your campus or with your organization. Get in touch with those who support those facilities to ask for information.

If you don’t have access to these resources, you can back them up on hard drives. Have two backups, and keep the hard drives in different physical locations.

You can also use cloud resources; with them, you put your information in the cloud, so you won’t lose it even if you lose your computer. Some of them are Amazon S3, Microsoft Azure, Google Cloud or others for cloud storage. The open science framework is a free option for storing files up to 5 GB. See more in the lesson “Introduction to Cloud Computing for Genomics”.

Apart from these cloud resources specifically for storage, other cloud services allow you to have computing capacity for data processing and analysis, larger than the capacity of a common personal computer, like the Amazon Web Services instances that we will use during this workshop.

Summary


Before data analysis has begun, there are already many potential areas for errors and omissions. Keeping organized, and keeping a critical eye can help catch mistakes.

One of Data Carpentry’s goals is to help you achieve competency in working with bioinformatics. This aim means you can accomplish routine tasks under normal conditions in an acceptable amount of time. While an expert might be able to get to a solution on instinct alone - taking your time, using Google or another Internet search engine, and asking for help are all valid ways of solving your problems. As you complete the lessons, you’ll be able to use all those methods more efficiently.

Where to go from here?

More reading about core competencies

L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta and M. Schneider, ‘Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies’, PLoS Comput Biol, vol. 10, no. 3, p. e1003496, 2014.

Key Points

  • Data being sent to a sequencing center also needs to be structured so you can use it.
  • Raw sequencing data should be kept raw somewhere, so you can always go back to the original files.

Content from Examining Data on the NCBI SRA Database


Last updated on 2025-04-16 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How do I access public sequencing data?

Objectives

  • Be aware that public genomic data is available.
  • Understand how to access and download this data.

Public data


In our experiments, we usually think about generating our sequencing data. However, almost all analyses use reference data, and you may want to use it to compare your results or annotate your data with publicly available data. You may also want to do a full project or set of analyses using publicly available data. This data is a great and essential resource for genomic data analysis.

When you come to publish a paper including your sequencing data, most journals and funders require that you place your data in a public repository. It helps to prepare for this early! Sharing your data makes it more likely that your work will be re-used and cited.

There are many repositories for public data. Some model organisms or fields have specific databases, and there are ones for particular data types. Two of the most comprehensive public repositories are provided by the National Center for Biotechnology Information (NCBI) and the European Nucleotide Archive (EMBL-EBI). The NCBI’s Sequence Read Archive (SRA) is the database we will be using for this lesson, but the EMBL-EBI’s Nucleic Acid Archive (ENA) is also useful. The general processes are similar for any database.

Accessing the original archived data


The sequencing dataset (from Okie et al. 2020) adapted for this workshop was obtained from the NCBI Sequence Read Archive, which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex, and mastering its use is greater than the scope of this lesson. The papers will often have a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of the Okie et al. 2020 dataset, so a direct link cannot be found.

Using the SRA Run Selector


See the figures below to determine how data accession is provided within the original paper.

The next image shows the study’s title, “Genomic adaptations in information processing underpin trophic strategy in a whole-ecosystem nutrient enrichment experiment”, as well as the authors.

Screenshot of the cover page of the article named: Genomic adaptations in information processing underpin trophic strategy in a whole-ecosystem nutrient enrichment experiment

The image below shows an excerpt from the paper that includes information on locating the sequence data. In this case, this text occurs just before the reference section. In the section data availability, the image says that data and metadata have been submitted to the Sequence Reading Archive (SRA) in NCBI and are accessible through the BioProject PRJEB22811. Notice that metadata registers that year was 2017, the place was Cuatro Cienegas Lagunita, the experiment was a fertilization one, and the author is The Craig Venter´s institute.

Screenshot of the section of the article called Additional file. It shows the following text: Supplementary files: Source data 1. Data on the metagenomic traits and concentrations of seston chlorophyll a, phosphorus, nitrogen, and carbon in water samples from Lagunitas pond, Cuatro ciénegas, Mexico. Data availability: Raw sequence data and metadata have been submitted to the NCBI Sequence Read Archive, accessible through BioProject PRJEB228811. The following dataset was generated: Author(s): J Craig Venter Institute, Year: 2017, Dataset title: Cuatro Ciénegas Lagunita Fertilization Experiment, Database, and Identifier: NCBI BioProject, PREJB22811, Dataset URL

Follow the next steps to access the data in the SRA using the information in this section.

  1. The paper references “PRJEB22811” as a “BioProject” at NCBI. Go to the NCBI website and search for “PRJEB22811”.

  2. You will be shown a link to the “Cuatro Cienegas Lagunita Fertilization Experiment” BioProject. Click on it.

  3. Once on the BioProject page, scroll down to the table under Project Data.

  4. This table says there are 40 links to the SRA Experiments of this project. Click on the number 40.

  5. Now, you are on the NCBI-SRA site with the 40 samples of this project. This site its NCBI’s new cloud-based SRA interface. At the top of the page is a Send to dropdown menu; click on it, select Run Selector and click Go. This click will take you to this page in the SRA Run Selector. You will be presented with a page for the overall BioProject accession PRJEB22811 - this is a collection of all the experimental data.

  6. Notice on this page, there are three sections. “Common Fields”, “Select”, and “Found 40 Items”. The sections “Select” and “Found 40 items” are shown in the next image. Select contains information about the run size and the data and metadata table. The “Found 40 items section” is a table where each row contains id numbers, an alias name, size and links to data for one sample. Within “Found 40 Items”, click on the first Run number (Column “Run”, Row “1”). Screenshot of the sections Select and Found 40 Items

  7. This will take you to a page that is a Run Browser. Take a few minutes to examine some of the descriptions on the page. In the image we see the SRA entry ERR2143758. This metadata tab displays run’s quality and GC content among other information. Screenshot of details for the selected run. It shows the details of the sequence file, a quality graph, the metadata, the Biosample details, and BioProject details.

  8. Use the browser’s back button to go back to the ‘previous page’. As shown in the figure below, the second section of the page (“Select”) has the Total row showing you the current number of “Runs”, “Bytes”, and “Bases” in the dataset to date. On 2012-06-27, there were 40 runs, 9.86 GBytes of data, and 19.61 Gbases. Screenshot of the section Select, with two rows with the number of runs, bytes, bases, and buttons to download the Metadata and Accession List, one row for Total and one row for Selected.

  9. Click on the “Metadata” button to download the file “SraRunTable.txt” and save it on your computer.

  10. Review the SraRunTable in a spreadsheet program. Using your favorite spreadsheet program, open the SraRunTable.txt file. If prompted by the spreadsheet software, be aware that the SRA Run Selector provides a comma-separated file.

Delimiters


The fields in a table are separated (or delimited) usually by commas or tabs, so they are named with the .csv (comma-separated values) and .tsv(tab-separated values) extensions, respectively. But since they are both plain text files, you can find them with the .txt extension, just like in our SraRunTable.txt. {: .callout}

Discussion 1

Discuss with the person next to you:

  1. What was the sequencing platform used for this experiment?
  2. What samples in the experiment contain paired end sequencing data?
  3. What other kind of metadata is available?
  1. The Illumina sequencing platform was used, shown in the column “Platform”. The column “Instrument” shows which type of Illumina sequencer was used, in this case, Illumina MiSeq.
  2. The “LibraryLayout” column shows that all samples contain paired-end data.v
  3. Technology and instruments are good examples of the types of metadata that can exist for a sequenced biological sample. There is technical information, like “Assay Type” and “DATASTORE filetype”, information about the sequences like “Bases” and biological metadata like “environment_(biome)” and “potassium_ppm”.

After answering the questions, you should avoid saving any changes you might have made to this file. We don’t want to make any changes. If you were to save this file, make sure you save it as a plain .txt file. Remember to keep raw things raw.

Discussion 2: Exploring the European Nucleotide Archive

Navigate to the ENA and search the BioProject “PRJEB22811”. Explore the ENA Browser and discuss it with your neighbor the differences between the ENA Browser and the SRA Run Selector.

Downloading reads

For downloading the reads, there are mainly two options: * One by one: Go to the Run Browser of each sample, navigate to the tab FASTA/FASTQ download and click on the FASTQ button. * Complete dataset: In the SRA Run Selector of the BioProject, go to the Select section and click on the Accession List button. This will download a text file SRR_Acc_List.txt that you can use to download the reads in bulk with the SRA Toolkit, a command-line software package, which is outside the scope of this lesson.

Where to learn more


About the Sequence Read Archive

  • You can learn more about the SRA by reading the SRA Documentation
  • The best way to transfer a large SRA dataset is by using the SRA Toolkit

References


Jordan G Okie, Amisha T Poret-Peterson, et al. Genomic adaptations in information processing underpin trophic strategy in a whole-ecosystem nutrient enrichment experiment. eLife; 2020. DOI: 10.7554/eLife.49816 Paper.
Data on NCBI SRA: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJEB22811&o=acc_s%3Aa
Data on EMBL-EBI ENA: https://www.ebi.ac.uk/ena/browser/view/PRJEB22811

Key Points

  • Public data repositories are a great source of genomic data.
  • You are likely to put your data on a public repository.