Secondary metabolite biosynthetic gene cluster identification

Last updated on 2026-02-20 | Edit this page

Estimated time: 30 minutes

Overview

Questions

How can I annotate known BGC?
Which kind of analysis antiSMASH can perform?
Which file extension accepts antiSMASH?

Objectives

Understand antiSMASH applications.
Perform a Minimal antiSMASH run analysis.
Explore several Streptococcus genomes by identifying the BGCs presence and the types of secondary metabolites produced.

Introduction

Within microbial genomes, we can find some specific regions that take part in the biosynthesis of secondary metabolites, these sections are known as Biosynthetic Gene Clusters, which are relevant due to the possible applications that they may have, for example: antimicrobials, antitumors, cholesterol-lowering, immunosuppressants, antiprotozoal, antihelminth, antiviral and anti-aging activities.

What is a natural Product?

antiSMASH is a pipeline based on profile hidden Markov models that allow us to identify the gene clusters contained within the genome sequences that encode secondary metabolites of all known broad chemical classes.

antiSMASH input files

antiSMASH pipeline can work with three different file formats GenBank, FASTA and EMBL. Both GenBank and EMBL formats include genome annotations, while a FASTA file just comprises the nucleotides of each genomic contig.

Running antiSMASH

The command line usage of antiSMASH is detailed in the following repositories:

In summary, you will need to use your genome as the input. Then, antiSMASH will create an output folder for each of your genomes. Within this folder, you will find a single .gbk file for each of the detected Biosynthetic Gene Clusters (we will use these files for subsequent analyses) and a .html file, among others. By opening the .html file, you can explore the antiSMASH annotations.

You can run antiSMASH in two ways Minimal and Full-featured run, as follows:

Run type	Command
Minimal run	antismash genome.gbk
Full-featured run	antismash –cb-general –cb-knownclusters –cb-subclusters –asf –pfam2go –smcog-trees genome.gbk

Challenge

Exercise 1: antiSMASH scope

If you run an Streptococcus agalactiae annotated genome using antiSMASH, what results do you expect?

The substances that a cluster produces
A prediction of the metabolites that the clusters can produce
A prediction of genes that belong to a biosynthetic cluster

Show me the solution

False. antiSMASH is not an algorithm devoted to substance prediction.
False. Although antiSMASH does have some information about metabolites produced by similar clusters, this is not its main purpose.
True. antiSMASH compares domains and performs a prediction of the genes that belong to biosynthetic gene clusters.

Run your own antiSMASH analysis

First, activate the GenomeMining conda environment:

BASH

$ conda deactivate
$ conda activate /miniconda3/envs/GenomeMining_Global

Second, run the antiSMASH command shown earlier in this lesson on the data .gbk or .fasta files. The command can be executed either in one single file, all the files contained within a folder or in a specific list of files. Here we show how it can be performed in these three different cases:

Running antiSMASH in a single file

Choose the annotated file ´agalactiae_A909_prokka.gbk´

BASH

$ mkdir -p ~/pan_workshop/results/antismash  
$ cd ~/pan_workshop/results/antismash  
$ antismash --genefinding-tool=none ~/pan_workshop/results/annotated/Streptococcus_agalactiae_A909_prokka/Streptococcus_agalactiae_A909_prokka.gbk

To see the antiSMASH generated outcomes do the following:

BASH

$ tree -L 1 ~/pan_workshop/results/antismash/Streptococcus_agalactiae_A909_prokka

OUTPUT

.
├── CP000114.1.region001.gbk
├── CP000114.1.region002.gbk
├── css
├── images
├── index.html
├── js
├── regions.js
├── Streptococcus_agalactiae_A909.gbk
├── Streptococcus_agalactiae_A909.json
├── Streptococcus_agalactiae_A909.zip
└── svg

Running antiSMASH over a list of files

Now, imagine that you want to run antiSMASH over all Streptococcus agalactiae annotated genomes. Use a for cycle and * as a wildcard to run antiSMASH over all files that start with “S” in the annotated directory.

BASH

$ for gbk_file in ~/pan_workshop/results/annotated/*/S*.gbk
> do
>	antismash --genefinding-tool=none $gbk_file
> done

Visualizing antiSMASH results

In order to see the results after an antiSMASH run, we need to access to the index.html file. Where is this file located?

BASH

$ cd Streptococcus_agalactiae_A909_prokka
$ pwd

OUTPUT

~/pan_workshop/results/antismash/Streptococcus_agalactiae_A909_prokka

As outcomes you should get a folder comprised mainly by the following files:

.gbk files For each Biosynthetic Gene cluster region found.
.json file To know the input file name, the antiSMASH used version and the regions data (id,sequence_data).
index.html file To visualize the outcomes from the analysis.

In order to access these results, we can use scp protocol to download the directory in your local computer.

If using scp , on your local machine, open a GIT bash terminal in the destiny folder and execute the following command:

BASH

$ scp -r user@bioinformatica.matmor.unam.mx:~/pan_workshop/results/antismash/S*A909.prokka/* ~/Downloads/.

If using R-studio then in the left panel, choose the “more” option, and “export” your file to your local computer. Decompress the Streptococcus_agalactiae_A909.prokka.zip file.

Another way to download the data to your computer, you first compress the folder and download the compressed file from JupyterHub

BASH

$ cd ~/pan_workshop/results/antismash
$ zip -r Streptococcus_agalactiae_A909_prokka.zip Streptococcus_agalactiae_A909_prokka
$ ls

In the JupyterHub, navigate to the pan_workshop/results/antismash/ folder, select the file we just created, and press the download button at the top

Once in your local machine, in the directory, Streptococcus_agalactiae_A909.prokka, open the index.html file on your local web browser.

Understanding the output

The visualization of the results includes many different options. To understand all the possibilities, we suggest reading the following tutorial:

Briefly, in the “Overview” page ´.HTML´ you can find all the regions found within every analyzed record/contig (antiSMASH inputs), and summarized all this information in five main features:

Region: The region number.
Type: Type of the product detected by antiSMASH.
From,To: The region’s location (nucleotides).
Most similar known cluster: The closest compound from th MIBiG database.
Similarity: Percentage of genes within the closest known compound that have significant BLAST hit (The last two columns containing comparisons to the MIBiG database will only be shown if antiSMASH was run with the KnownClusterBlast option ´–cc-mibig´).

antiSMASH web services

antiSMASH can also be used in an online server in the antiSMASH website: You will be asked to give your email. Then, the results will be sent to you and you will be allowed to donwload a folder with the annotations.

Exercises and discussion

Challenge

Exercise 2: NCBI and antiSMASH webserver

Run antiSMASH web server over S. agalactiae A909. First, explore the NCBI assembly database to obtain the accession. Get the id of your results.

Show me the solution

Go to NCBI and search S. agalactiae A909.
Choose the assembly database and copy the GenBank sequence ID at the bottom of the site.
Go to antiSMASH
Choose the accession from NCBI and paste CP000114.1
Run antiSMASH and paste into the collaborative document your results id example bacteria-cbd13a47-8095-495f-957c-dcf58c261529

Challenge

Exercise 3: (Optional exercise ) Run antiSMASH over thermophilus*

Try Download and annotate S. thermophilus genomes strains LMD-9 (MiBiG), and strain CIRM-BIA 65 reference genome. With the following information, generate the script to run antiSMASH only on the S. thermophilus annotated genomes.

BASH

 done  
  antismash --genefinding-tool=none $__  
 for ___ in  ____________  
do

Show me the solution

Guide to solve this problem First download genomes with genome-download. Remember to activate the conda environment!

Download the annotated genomes, and then the genbank files. 0. ncbi-genome-download --formats genbank --genera "Streptococcus thermophilus" -S "LMD-9" -o thermophilus bacteria ncbi-genome-download --formats fasta --genera "Streptococcus thermophilus" -S "CIRM-BIA 65" -n bacteria Then, move the genomes to the directory ~/pan_workshop/results/annotated/S*.gbk

First, we need to start the for cycle: for mygenome in ~/pan_workshop/results/annotated/S*.gbk
note that we are using the name mygenome as the variable name in the for cycle.
Then you need to use the reserved word do to start the cycle.
Then you have to call antiSMASH over your variable mygenome. Remember the $ before your variable to indicate to bash that now you are using the value of the variable.
Finally, use the reserved word done to finish the cycle.

BASH

 for variable in ~/pan_workshop/results/annotated/t*.gbk  
do  
	antismash --genefinding-tool=none $mygenome  
done

Challenge

Exercise 4: The size of a region

Sort the structure of the next commands that attempt to know the size of a region:

SOURCE ORGANISM LOCUS
head grep

BASH

 $ _____  region.gbk  
 $ _____ __________ region.gbk

Apply your solution to get the size of the first region of S. agalactiae A909

Show me the solution

BASH

$ cd ~/pan_workshop/results/antismash  
$ head Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk
$ grep LOCUS Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk

Locus contains the information of the characteristics of the sequence

BASH

LOCUS   	CP000114.1         	42196 bp	DNA 	linear   UNK 13-JUN-2022

In this case the size of the region is 42196 bp

Challenge

Discussion 1: BGC classes in Streptococcus

What BGC classes does the genus Streptococcus have ?

Show me the solution

In antiSMASH website output of A909 strain we see clusters of two classes, T3PKS and arylpolyene. Nevertheless, this is only one Streptococcus in order to infer all BGC classes in the genus we need to run antiSMASH in more genomes.

Key Points

antiSMASH is a bioinformatic tool capable of identifying, annotating and analysing secondary metabolite BGC
antiSMASH can be used as a web-based tool or as stand-alone command-line tool
The file extensions accepted by antiSMASH are GenBank, FASTA and EMBL

Secondary metabolite biosynthetic gene cluster identification

Overview

Questions

Objectives

Introduction

antiSMASH input files

Running antiSMASH

Exercise 1: antiSMASH scope

Show me the solution

Run your own antiSMASH analysis

BASH

Running antiSMASH in a single file

BASH

BASH

OUTPUT

Running antiSMASH over a list of files

BASH

Visualizing antiSMASH results

BASH

OUTPUT

BASH

BASH

Understanding the output

antiSMASH web services

Exercises and discussion

Exercise 2: NCBI and antiSMASH webserver

Show me the solution

Exercise 3: (*Optional exercise ) Run antiSMASH over thermophilus

BASH

Show me the solution

BASH

Exercise 4: The size of a region

BASH

Show me the solution

BASH

BASH

Discussion 1: BGC classes in Streptococcus

Show me the solution

Exercise 3: (Optional exercise ) Run antiSMASH over thermophilus*