Secondary metabolite biosynthetic gene cluster identification
Last updated on 2026-02-20 | Edit this page
Overview
Questions
- How can I annotate known BGC?
- Which kind of analysis antiSMASH can perform?
- Which file extension accepts antiSMASH?
Objectives
- Understand antiSMASH applications.
- Perform a Minimal antiSMASH run analysis.
- Explore several Streptococcus genomes by identifying the BGCs presence and the types of secondary metabolites produced.
Introduction
Within microbial genomes, we can find some specific regions that take part in the biosynthesis of secondary metabolites, these sections are known as Biosynthetic Gene Clusters, which are relevant due to the possible applications that they may have, for example: antimicrobials, antitumors, cholesterol-lowering, immunosuppressants, antiprotozoal, antihelminth, antiviral and anti-aging activities.
What is a natural Product?
antiSMASH is a pipeline based on profile hidden Markov models that allow us to identify the gene clusters contained within the genome sequences that encode secondary metabolites of all known broad chemical classes.
antiSMASH input files
antiSMASH pipeline can work with three different file formats
GenBank, FASTA and EMBL. Both
GenBank and EMBL formats include genome
annotations, while a FASTA file just comprises the
nucleotides of each genomic contig.
Running antiSMASH
The command line usage of antiSMASH is detailed in the following repositories:
In summary, you will need to use your genome as the input. Then,
antiSMASH will create an output folder for each of your genomes. Within
this folder, you will find a single .gbk file for each of
the detected Biosynthetic Gene Clusters (we will use these files for
subsequent analyses) and a .html file, among others. By
opening the .html file, you can explore the antiSMASH
annotations.
You can run antiSMASH in two ways Minimal and Full-featured run, as follows:
| Run type | Command |
|---|---|
| Minimal run | antismash genome.gbk |
| Full-featured run | antismash –cb-general –cb-knownclusters –cb-subclusters –asf –pfam2go –smcog-trees genome.gbk |
Exercise 1: antiSMASH scope
If you run an Streptococcus agalactiae annotated genome using antiSMASH, what results do you expect?
The substances that a cluster produces
A prediction of the metabolites that the clusters can produce
A prediction of genes that belong to a biosynthetic cluster
- False. antiSMASH is not an algorithm devoted to substance
prediction.
- False. Although antiSMASH does have some information about
metabolites produced by similar clusters, this is not its main
purpose.
- True. antiSMASH compares domains and performs a prediction of the genes that belong to biosynthetic gene clusters.
Run your own antiSMASH analysis
First, activate the GenomeMining conda environment:
Second, run the antiSMASH command shown earlier in this lesson on the
data .gbk or .fasta files. The command can be
executed either in one single file, all the files contained within a
folder or in a specific list of files. Here we show how it can be
performed in these three different cases:
Running antiSMASH in a single file
Choose the annotated file ´agalactiae_A909_prokka.gbk´
BASH
$ mkdir -p ~/pan_workshop/results/antismash
$ cd ~/pan_workshop/results/antismash
$ antismash --genefinding-tool=none ~/pan_workshop/results/annotated/Streptococcus_agalactiae_A909_prokka/Streptococcus_agalactiae_A909_prokka.gbk
To see the antiSMASH generated outcomes do the following:
OUTPUT
.
├── CP000114.1.region001.gbk
├── CP000114.1.region002.gbk
├── css
├── images
├── index.html
├── js
├── regions.js
├── Streptococcus_agalactiae_A909.gbk
├── Streptococcus_agalactiae_A909.json
├── Streptococcus_agalactiae_A909.zip
└── svg
Visualizing antiSMASH results
In order to see the results after an antiSMASH run, we need to access
to the index.html file. Where is this file located?
OUTPUT
~/pan_workshop/results/antismash/Streptococcus_agalactiae_A909_prokka
As outcomes you should get a folder comprised mainly by the following files:
-
.gbkfiles For each Biosynthetic Gene cluster region found. -
.jsonfile To know the input file name, the antiSMASH used version and the regions data (id,sequence_data). -
index.htmlfile To visualize the outcomes from the analysis.
In order to access these results, we can use scp protocol to download the directory in your local computer.
If using scp , on your local machine, open a GIT bash
terminal in the destiny folder and execute the
following command:
BASH
$ scp -r user@bioinformatica.matmor.unam.mx:~/pan_workshop/results/antismash/S*A909.prokka/* ~/Downloads/.
If using R-studio then in the left panel, choose the “more” option, and “export” your file to your local computer. Decompress the Streptococcus_agalactiae_A909.prokka.zip file.
Another way to download the data to your computer, you first compress the folder and download the compressed file from JupyterHub
BASH
$ cd ~/pan_workshop/results/antismash
$ zip -r Streptococcus_agalactiae_A909_prokka.zip Streptococcus_agalactiae_A909_prokka
$ ls
In the JupyterHub, navigate to the pan_workshop/results/antismash/ folder, select the file we just created, and press the download button at the top
Once in your local machine, in the directory,
Streptococcus_agalactiae_A909.prokka, open the index.html
file on your local web browser.
Understanding the output
The visualization of the results includes many different options. To understand all the possibilities, we suggest reading the following tutorial:
Briefly, in the “Overview” page ´.HTML´ you can find all the regions found within every analyzed record/contig (antiSMASH inputs), and summarized all this information in five main features:
- Region: The region number.
- Type: Type of the product detected by antiSMASH.
- From,To: The region’s location (nucleotides).
- Most similar known cluster: The closest compound from th MIBiG database.
- Similarity: Percentage of genes within the closest known compound that have significant BLAST hit (The last two columns containing comparisons to the MIBiG database will only be shown if antiSMASH was run with the KnownClusterBlast option ´–cc-mibig´).
antiSMASH web services
antiSMASH can also be used in an online server in the antiSMASH website: You will be asked to give your email. Then, the results will be sent to you and you will be allowed to donwload a folder with the annotations.
Exercises and discussion
Exercise 2: NCBI and antiSMASH webserver
Run antiSMASH web server over S. agalactiae A909. First, explore the NCBI assembly database to obtain the accession. Get the id of your results.
- Go to NCBI and search S. agalactiae A909.
- Choose the assembly database and copy the GenBank sequence ID at the bottom of the site.
- Go to antiSMASH
- Choose the accession from NCBI and paste
CP000114.1 - Run antiSMASH and paste into the collaborative document your results
id example
bacteria-cbd13a47-8095-495f-957c-dcf58c261529
Exercise 3: (*Optional exercise ) Run antiSMASH over thermophilus
Try Download and annotate S. thermophilus genomes strains LMD-9 (MiBiG), and strain CIRM-BIA 65 reference genome. With the following information, generate the script to run antiSMASH only on the S. thermophilus annotated genomes.
Guide to solve this problem First download genomes with genome-download. Remember to activate the conda environment!
Download the annotated genomes, and then the genbank files. 0.
ncbi-genome-download --formats genbank --genera "Streptococcus thermophilus" -S "LMD-9" -o thermophilus bacteria
ncbi-genome-download --formats fasta --genera "Streptococcus thermophilus" -S "CIRM-BIA 65" -n bacteria
Then, move the genomes to the directory
~/pan_workshop/results/annotated/S*.gbk
First, we need to start the for cycle:
for mygenome in ~/pan_workshop/results/annotated/S*.gbk
note that we are using the namemygenomeas the variable name in the for cycle.Then you need to use the reserved word
doto start the cycle.Then you have to call antiSMASH over your variable
mygenome. Remember the$before your variable to indicate to bash that now you are using the value of the variable.Finally, use the reserved word
doneto finish the cycle.
BASH
$ cd ~/pan_workshop/results/antismash
$ head Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk
$ grep LOCUS Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk
Locus contains the information of the characteristics of the sequence
In this case the size of the region is 42196 bp
Discussion 1: BGC classes in Streptococcus
What BGC classes does the genus Streptococcus have ?
In antiSMASH website output of A909 strain we see clusters of two classes, T3PKS and arylpolyene. Nevertheless, this is only one Streptococcus in order to infer all BGC classes in the genus we need to run antiSMASH in more genomes.
- antiSMASH is a bioinformatic tool capable of identifying, annotating and analysing secondary metabolite BGC
- antiSMASH can be used as a web-based tool or as stand-alone command-line tool
- The file extensions accepted by antiSMASH are GenBank, FASTA and EMBL