Summary and Setup

Edit this page

Data Carpentry aims to teach researchers basic concepts, skills, and tools for working with data so they can get more done quickly and with less pain. This workshop uses Data Carpentry’s approach to teach data management and analysis for metagenomics research, including: best practices for the organization of bioinformatics projects and data, use of command-line utilities, use of command-line tools to analyze sequence quality, use of R studio and use of R libraries to compare diversity between samples, and connecting to and using cloud computing. This workshop is designed to be taught over two full days of instruction.

Would you be interested in teaching these materials? We have an Slack channel were we will be happy to help you!

Frequently Asked Questions

Read our FAQ to learn more about Data Carpentry’s Metagenomics workshop as an Instructor or a workshop host.

Getting Started

This lesson assumes that learners have no prior experience with the tools covered in the workshop.

However, learners are expected to have some familiarity with biological concepts, including the concept of DNA sequencing, nucleotide abbreviations, genome, microbiome, and taxonomy. Participants should bring their own laptops and plan to participate actively.

To get started, follow the directions in the Setup tab to get access to the required software and data for this workshop.

Data

This workshop uses data from the environmental experiment: Genomic adaptations in information processing underpins trophic strategy in a whole-ecosystem nutrient enrichment experiment, by Jordan G Okie et al. In this research, authors compared the differences between the microbial community in its natural, oligotrophic, phosphorus-deficient environment, a pond from the Cuatro Ciénegas Basin (CCB), and the same microbial community under a fertilization treatment.

All of the data used in this workshop can be downloaded from More information about this data is available on the Data page.

Workshop Overview

Lesson	Overview	Estimated time
Project Organization and Management	Learn how to structure your metadata, organize and document your metagenomics data and bioinformatics workflow, and access data on the NCBI sequence read archive (SRA) database.	1:30 hr
Introduction to the Command Line	Learn to navigate your file system, create, copy, move, and remove files and directories, and automate repetitive tasks using scripts and wildcards.	4:00 hr
Introduction to R	Use R studio to manage several data types and structures.	1:00 hr
Data Processing and Visualization for Metagenomics	Use command-line tools to perform quality control, metagenomic assembly, metagenomic binning, taxonomic assignment, and diversity exploration.	6:30 hr

Lessons Reference

The content of this page and three of the lessons presented in this workshop are adapted from lessons on the Data Carpentry Genomics Workshop.

Teaching Platform

This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances. All the software and data used in the workshop are hosted on an Amazon Machine Image (AMI). If you want to run your instance of the server used for this workshop, follow the directions in the Setup tab.

Citation

Please cite as:

Claudia Zirión Martínez; Diego Garfias Gallegos; Tania Vanessa Arellano Fernández; Aarón Espinosa Jaime; Edder D Bustos Díaz; José Abel Lovaco Flores; Luis Gerardo Tejero Gómez; J Abraham Avelar Rivas; Nelly Sélem (March , 2023) A Data Carpentry- Style Metagenomics Workshop

In a Data Carpentry Workshop

This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances (a computer with all the required programs and files to which you will have access from your computer). Except for a spreadsheet program and an internet browser, all of the command line software and data used in the workshop are hosted on an Amazon Machine Image (AMI). If you are signed up to take a Metagenomics Data Carpentry Workshop, you do not need to worry about setting up an AMI instance. The Carpentries staff will create an instance for you, which will be free. This setup is accurate for both self-organized and centrally-organized workshops. Your Instructor will provide instructions for connecting to the AMI instance at the workshop.

If you are in The Carpentries-Workshop, you do not even need to install a bash terminal; the R-studio terminal provided in the AWS-AMI is enough to run all the commands in the lesson. Instead of connecting by ssh, users can use the R-studio AMI terminal.

This lesson requires a working spreadsheet program. If you don’t have a spreadsheet program already, you can use LibreOffice. It’s a free, open-source spreadsheet program.

Running the lesson by yourself (Not in a Data Carpentry Workshop)

Required software

If you are not in a Data Carpentry Workshop, the software you need is listed in the table below. Follow the instructions in Option A or Option B to have access to these programs.

Software website	Used Version in Conda	Manual	Available for	Description
FastQC	0.11.9	Help	Linux, macOS, Windows	Quality control tool for high throughput sequence data.
Trimmomatic	0.39	GitHub	Linux, macOS, Windows	A flexible read trimming tool for Illumina NGS data.
Kraken	2.1.2	GitHub	Linux, macOS	A tool for taxonomic assignation for reads from metagenomics
KronaTools	2.8.1	GitHub	Linux, macOS, Windows	A tool for taxonomic visualization in hierarchical pie graphs.
MaxBin2	2.2.7	SourceForge	Linux, macOS	Tool for MAGs reconstruction
Spades	3.15.2	GitHub	Linux, macOS	Tool for assemblies
Kraken-biom	1.2.0	GitHub	Linux, macOS, Windows	Tool to convert kraken reports in R readable files
CheckM-genome	1.2.1	Wiki	Linux, macOs, Windows	Tool to check completeness and contamination in MAGs

Option A: Using the lessons with Amazon Web Services (AWS)

Follow these instructions on creating an Amazon instance. Use the AMI ami-0f58e878fa70cc201 named The Carpentries Lab Metagenomics v1.0 listed on the Community AMIs page. Please note that you must set your location as N. Virginia to access this community AMI. You can change your location in the upper right corner of the main AWS menu bar. The cost of using this AMI for a few days, with the t2.medium instance type, is very low (about USD $2.00 per user per day). Data Carpentry has no control over AWS pricing structure and provides this cost estimate without guarantees. Please read AWS documentation on pricing for up-to-date information.

If you’re an Instructor or Maintainer or want to contribute to these lessons, don’t hesitate to contact us at team@carpentries.org, and we will start instances for you.

In this instance, you can use the terminal available in RStudio, and users won’t need to install their terminals or use ssh (see Instructor Notes). If, nevertheless, you prefer that the users install their own terminals, directions to install them are included for each Windows, Mac OS X, and Linux below in the Option B section. For Windows, you will need to install Git Bash, PuTTY, or the Ubuntu Subsystem.

Option B: Following the lessons on your local machine

If you trust that your computer is powerful enough and want to have all the programs installed, you can follow all the workshops without using an AWS remote machine. To do this, you will need to install all of the software used in the workshop and obtain a copy of the dataset. Instructions for doing this are below.

Data

The data used in this workshop are available on Zenodo. Please read the Zenodo page linked below for information about the data and access to the data files. Because this workshop works with real data, be aware that file sizes for the data are large.

More information about these data will be presented in the first episode of the Data processing and visualization for metagenomics lesson.

Install a Bash terminal

Windows

Download the Git for Windows installer. Run the installer and follow the steps below:
- Click on “Next” four times (two times if you’ve previously installed Git). You don’t need to change anything in the information, location, components, and start menu screens.
- Select “Use the nano editor by default” and click on “Next”.
- Keep “Use Git from the Windows Command Prompt” selected and click on “Next”. If you forget to do this, the programs that you need for the workshop will not work properly. If this happens, rerun the installer and select the appropriate option.
- Select “Use bundled OpenSSH” and click on “Next”.
- Select “Use the OpenSSL Library” and click “Next”.
- Keep “Checkout Windows-style, commit Unix-style line endings” selected and click on “Next”.
- Select “Use Windows’ default console window” and click on “Next”.
- Select “Default (fast-forward on merge)” and click on “Next”.
- Select “None” (Do not use a credential helper) and click on “Next”.
- Select “Enable file system caching” and click on “Next”.
- Ignore “Configuring experimental options” and click on “Install”.
- Click on “Install”.
- Click on “Finish”.
- If your “HOME” environment variable is not set (or you don’t know what this is):
- Open command prompt (Open Start Menu, then type cmd and press [Enter])
- Type the following line into the command prompt window exactly as shown: setx HOME "%USERPROFILE%"
- Press [Enter], and you should see SUCCESS: Specified value was saved.
- Quit the command prompt by typing exit and then pressing [Enter]
- See the video tutorial for an example of how to install Git on Windows 11.
An alternative option is to install PuTTY by going to the the installation page. For most newer computers, click on putty-64bit-X.XX-installer.msi to download the 64-bit version. If you have an older laptop, you may need to get the 32-bit version putty-X.XX-installer.msi. If you aren’t sure whether you need the 64 or 32-bit version, you can check your laptop version by following the instructions here. Once the installer is downloaded, double-click on it, and PuTTY should install.
Another alternative option is to use the Windows Subsystem Linux (WSL). This option is available for Windows 10 and Windows 11 - detailed instructions are available here. See the video tutorial for an example of how to install WSL with Ubuntu 22.04 on Windows 11.

macOS

The default shell in some versions of macOS is Bash, and Bash is available in all versions, so no need to install anything. You access Bash from the Terminal Application (found in /Applications/Utilities). See how to open the terminal in the video tutorial. You can keep the terminal in your dock for this workshop.

Linux

The default shell is usually Bash, and there is usually no need to install anything. To see if your default shell is Bash type, echo $SHELL in a terminal and press Enter. If the message printed does not end with /bash, then your default is something else, and you can run Bash by typing bash.

Install Miniconda3

These instructions assume familiarity with the command line and with installation in general. There are different operating systems and many different versions of operating systems and environments, so these may not work on your computer. If an installation doesn’t work for you, please refer to the user guide for the tool listed in the table above. If you have difficulties with the installations or find better ways to install things in your operating system, please raise an Issue to let us know.

To make a Conda environment, first, you need to install Conda. We recommend installing the Miniconda3 version. Miniconda is a package manager that includes Conda and its dependencies and simplifies the installation process. Please first install Miniconda3 (installation instructions below) and then proceed to the installation of the environment.

Linux

To install miniconda3, see the video tutorial

MacOSX

In a terminal type:

BASH

$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh

Then, follow the instructions that you are prompted with on the screen to install Miniconda3.

WSL

See the video tutorial, installing Miniconda3 on WSL Ubuntu

Install the metagenomics environment

Once your Miniconda3 is ready, follow these instructions to install and activate the metagenomics environment.

Linux: Option 1 (recommended)

The easier way to install the environment is using the specifications file for Linux Ubuntu 22.04, which has the exact versions of each tool in this environment. You can use the spec file as follows:

BASH

$ conda create --name metagenomics --file spec-file-Ubuntu22.txt

More information about how to use environments and spec files is available at conda documentation

Linux: option 2

Another way to create an environment is with a ỳml file. This environment can be modified by adding or deleting tools in a file metagenomics-Ubuntu22.yml.

In Ubuntu 22.04, copy this file metagenomics-Ubuntu22.yml to your computer and follow the instructions in the video tutorial

MacOSX

It has been difficult to find compatibility between all the dependencies of each package installed in the metagenomics environment. In the case of the latest version of macOS (Monterey), the MaxBin2 package can be installed, but it does not fully work at the time of use. Copy the file metagenomics-macOS.yml in your computer and run:

BASH

$ conda env create -f metagenomics-macOS.yml

WSL

In the case of Windows Subsystem for Linux WSL Ubuntu 22.04, the MaxBin2 package has an incompatibility with the checkm-genome package, so we have decided to leave it out of the metagenomics environment and create its own environment with the file(metagenomics-maxbin.yml). The file for the metagenomics environment is metagenomics-WSLUbuntu.yml See the video tutorial

BASH

$ conda env create -f metagenomics-maxbin.yml
$ conda env create -f metagenomics-WSLUbuntu.yml

Execute some remaining installation scripts

Change dcuser with your own username. And run all these lines:

BASH

bash /home/dcuser/.miniconda3/envs/metagenomics/opt/krona/updateTaxonomy.sh                                
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz 
tar -xzf taxdump.tar.gz 
mkdir .taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp /home/dcuser/.taxonkit
rm *dmp readme.txt taxdump.tar.gz gc.prt

Install R and RStudio

R and RStudio are two separate pieces of software:

R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis RStudio is an integrated development environment (IDE) that makes using R easier. In this course, we use RStudio to interact with R.

Mac OS X

Download R from the CRAN website.
Select the .pkg file for the latest R version
Double-click on the downloaded file to install R
It is also a good idea to install XQuartz (needed by some packages)
Go to the RStudio download page
Under Installers, select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers)
Double-click the file to install RStudio
Once it’s installed, open RStudio to ensure it works and you don’t get any error messages.

Windows

Download R from the CRAN website.
Run the .exe file that was just downloaded
Go to the RStudio download page
Under Installers select RStudio x.yy.zzz - Windows Vista/7/8/10 (where x, y, and z represent version numbers)
Double-click the file to install it
Once it’s installed, open RStudio to ensure it works and you don’t get any error messages.

Linux

Follow the instructions for your distribution from CRAN. They provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu, run sudo apt-get install r-base, and for Fedora, sudo yum install R). However, we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 3.3.1.
Go to the RStudio download page
Under Installers, select the version that matches your distribution, and install it with your preferred method (e.g., with Debian/Ubuntu ´sudo dpkg -i rstudio-x.yy.zzz-amd64.deb´ at the terminal).
Once it’s installed, open RStudio to ensure it works and you don’t get any error messages.

Install R libraries

Software	Version	Manual	Description
phyloseq	1.39.1	GitHub	Explore, manipulate and analyze microbiome profiles with R
ggplot2	3.3.6	GitHub	System for declaratively creating graphics, based on The Grammar of Graphics

Type these commands in your console:

R

> install.packages("phyloseq")
> install.packages("ggplot2")