Metagenome Assembly
Last updated on 2025-04-06 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- Why should genomic data be assembled?
- What is the difference between reads and contigs?
- How can we assemble a metagenome?
Objectives
- Understand what an assembly is.
- Run a metagenomics assembly workflow.
- Use an environment in a bioinformatic pipeline.
Assembling reads
The assembly process groups reads into contigs and contigs into scaffolds to obtain (ideally) the sequence of a whole chromosome. There are many programs devoted to genome and metagenome assembly, some of the main strategies they use are Greedy extension, OLC, and De Bruijn charts. Contrary to metabarcoding, shotgun metagenomics needs an assembly step, which does not mean that metabarcoding never uses an assembly step but sometimes is unnecessary.

MetaSPAdes is an NGS de novo assembler for assembling large and complex metagenomics data, and it is one of the most used and recommended. It is part of the SPAdes toolkit, which contains several assembly pipelines.
Some of the problems faced by metagenomics assembly are:
- Differences in coverage between the genomes due to the differences in abundance in the sample.
- The fact that different species often share conserved regions.
- The presence of several strains of a single species in the community.
SPAdes already deals with the non-uniform coverage problem in its algorithm, so it is helpful for the assembly of simple communities, but the metaSPAdes algorithm deals with the other problems as well, allowing it to assemble metagenomes from complex communities.
The process of (metagenomics) assembly can take a long time, and if the connection to the server drops, the process is killed, and the process needs to restart. To avoid this, we can create a screen session.
Screen sessions
Starting a new session
A ‘session’ can be considered a new window or screen: you might open a terminal to do one thing on the computer and then open a new terminal to work on another task at the command line. You can start a session and give it a descriptive name:
This steps creates a session with the name ‘assembly’.
As you work, this session will stay active until you close it. Even if you log out or work on something else, the jobs you start in this session will run until completion.
Detach session (process keeps running in the background)
You can detach from a session by pressing control + a
followed by d
(for detach) on your keyboard. If you
reconnect to your machine, you will have to reconnect to your session to
see how it went.
Additional session commands
Seeing active sessions If you disconnect from your
session or from your ssh, you will need to reconnect to an existing
screen
session. You can see a list of existing
sessions:
Reconnecting to a session To reconnect to an existing session:
The -r
option = ‘resume a detached screen session’
Kill a session To end a session, type
exit
after reconnecting to the session:
Let’s see if our program is installed correctly:
OUTPUT
SPAdes genome assembler v3.15.0 [metaSPAdes mode]
Usage: spades.py [options] -o <output_dir>
Basic options:
-o <output_dir> directory to store all the resulting files (required)
--iontorrent this flag is required for IonTorrent data
--test runs SPAdes on a toy dataset
-h, --help prints this usage message
-v, --version prints version
Input data:
--12 <filename> file with interlaced forward and reverse paired-end reads
-1 <filename> file with forward paired-end reads
-2 <filename> file with reverse paired-end reads
MetaSPAdes is a metagenomics assembler
The help we just saw tells us how to run metaspades.py
.
We are going to use the most straightforward options, just specifying
our forward paired-end reads with -1
and reverse paired-end
reads with -2
, and the output directory where we want our
results to be stored.
BASH
$ cd ~/dc_workshop/data/trimmed_fastq
$ metaspades.py -1 JC1A_R1.trim.fastq.gz -2 JC1A_R2.trim.fastq.gz -o ../../results/assembly_JC1A
Now that it is running we should detach our screen with
control + a
d
and wait for a few minutes while
it running. And then attach the screen with
screen -r assembly
to see how it went.
When the run is finished, it shows this message:
BASH
======= SPAdes pipeline finished.
SPAdes log can be found here: /home/dcuser/dc_workshop/results/assembly_JC1A/spades.log
Thank you for using SPAdes!
Now we can kill the screen with exit
and look at our
results in the main screen.
Now, let’s go to the output files:
OUTPUT
assembly_graph_after_simplification.gfa corrected/ K55/ scaffolds.fasta
assembly_graph.fastg dataset.info misc/ scaffolds.paths
assembly_graph_with_scaffolds.gfa first_pe_contigs.fasta params.txt spades.log
before_rr.fasta input_dataset.yaml pipeline_state/ strain_graph.gfa
contigs.fasta K21/ run_spades.sh tmp/
contigs.paths K33/ run_spades.yaml
As we can see, MetaSPAdes gave us a lot of files. The ones with the
assembly are the contigs.fasta
and the
scaffolds.fasta
. Also, we found three K
folders: K21, K33, and K55; this contains the individual result
files for an assembly with k-mers equal to those numbers: 21, 33, and
55. The best-assembled results are the ones that are displayed outside
these k-folders. The folder corrected
hold the corrected
reads with the SPAdes algorithm. Moreover, the file
assembly_graph_with_scaffolds.gfa
have the information
needed to visualize our assembly by different means, like programs such
as Bandage.
The contigs are just made from assembled reads, but the scaffolds are the result from a subsequent process in which the contigs are ordered, oriented, and connected with Ns.
We can recognize which sample our assembly outputs corresponds to
because they are inside the assembly results folder:
assembly_JC1A/
. However, the files within it do not have
the sample ID. If we need the files out of their folder, it is
beneficial to rename them.
Exercise 1: Rename all files in a folder (needed in the next episode)
Add the prefix JC1A
(the sample ID) separated by a
_
to the beginning of the names of all the contents in the
assembly_JC1A/
directory. Remember that many solutions are
possible.
-
$ mv * JC1A_
-
$ mv * JC1A_*
-
$ for name in *; do mv $name JC1A_; done
$ for name in *; do mv $name JC1A_$name; done
- No, this option is going to give you as error
mv: target 'JC1A_' is not a directory
This is becausemv
has two options:mv file_1 file_2
mv file_1, file_2, ..... file_n directory
When a list of files is passed tomv
, themv
expects the last parameters to be a directory.
Here,*
gives you a list of all the files in the directory. The last parameter isJC1A_
(whichmv
expects to be a directory). - No. Again, every file is sent to the same file.
- No, every file is sent to the same file JC1A_
- Yes, this is one of the possible solutions.
¿Do you have another solution?
Exercise 2: Compare two fasta files from the assembly output
You want to know how many contigs and scaffolds result from the
assembly. Use contigs.fasta
and
scaffolds.fasta
files and sort the commands to create
correct code lines.
Do they have the same number of lines? Why?
Hint: You can use the following commands:
grep
, |
(pipe), -l
,
">"
, wc
, filename.fasta
Quality of assemblies
You can use several metrics to know the quality of your assemblies. MetaQuast is a program that gives you these metrics for metagenome assemblies in an interactive report and text files and plots.
Key Points
- Assembly groups reads into contigs.
- De Bruijn Graphs use Kmers to assembly cleaned reads.
- Program screen allows you to keep open remote sessions.
- MetaSPAdes is a metagenomes assembler.
- Assemblers take FastQ files as input and produce a Fasta file as output.