Secondary metabolite biosynthetic gene cluster identification

Last updated on 2026-02-20 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can I annotate known BGC?
  • Which kind of analysis antiSMASH can perform?
  • Which file extension accepts antiSMASH?

Objectives

  • Understand antiSMASH applications.
  • Perform a Minimal antiSMASH run analysis.
  • Explore several Streptococcus genomes by identifying the BGCs presence and the types of secondary metabolites produced.

Introduction


Within microbial genomes, we can find some specific regions that take part in the biosynthesis of secondary metabolites, these sections are known as Biosynthetic Gene Clusters, which are relevant due to the possible applications that they may have, for example: antimicrobials, antitumors, cholesterol-lowering, immunosuppressants, antiprotozoal, antihelminth, antiviral and anti-aging activities.

What is a natural Product?

antiSMASH is a pipeline based on profile hidden Markov models that allow us to identify the gene clusters contained within the genome sequences that encode secondary metabolites of all known broad chemical classes.

antiSMASH input files


antiSMASH pipeline can work with three different file formats GenBank, FASTA and EMBL. Both GenBank and EMBL formats include genome annotations, while a FASTA file just comprises the nucleotides of each genomic contig.

Running antiSMASH


The command line usage of antiSMASH is detailed in the following repositories:

In summary, you will need to use your genome as the input. Then, antiSMASH will create an output folder for each of your genomes. Within this folder, you will find a single .gbk file for each of the detected Biosynthetic Gene Clusters (we will use these files for subsequent analyses) and a .html file, among others. By opening the .html file, you can explore the antiSMASH annotations.

You can run antiSMASH in two ways Minimal and Full-featured run, as follows:

Run type Command
Minimal run antismash genome.gbk
Full-featured run antismash –cb-general –cb-knownclusters –cb-subclusters –asf –pfam2go –smcog-trees genome.gbk
Challenge

Exercise 1: antiSMASH scope

If you run an Streptococcus agalactiae annotated genome using antiSMASH, what results do you expect?

  1. The substances that a cluster produces

  2. A prediction of the metabolites that the clusters can produce

  3. A prediction of genes that belong to a biosynthetic cluster

  1. False. antiSMASH is not an algorithm devoted to substance prediction.
  2. False. Although antiSMASH does have some information about metabolites produced by similar clusters, this is not its main purpose.
  3. True. antiSMASH compares domains and performs a prediction of the genes that belong to biosynthetic gene clusters.

Run your own antiSMASH analysis


First, activate the GenomeMining conda environment:

BASH

$ conda deactivate
$ conda activate /miniconda3/envs/GenomeMining_Global

Second, run the antiSMASH command shown earlier in this lesson on the data .gbk or .fasta files. The command can be executed either in one single file, all the files contained within a folder or in a specific list of files. Here we show how it can be performed in these three different cases:

Running antiSMASH in a single file

Choose the annotated file ´agalactiae_A909_prokka.gbk´

BASH

$ mkdir -p ~/pan_workshop/results/antismash  
$ cd ~/pan_workshop/results/antismash  
$ antismash --genefinding-tool=none ~/pan_workshop/results/annotated/Streptococcus_agalactiae_A909_prokka/Streptococcus_agalactiae_A909_prokka.gbk    

To see the antiSMASH generated outcomes do the following:

BASH

$ tree -L 1 ~/pan_workshop/results/antismash/Streptococcus_agalactiae_A909_prokka

OUTPUT

.
├── CP000114.1.region001.gbk
├── CP000114.1.region002.gbk
├── css
├── images
├── index.html
├── js
├── regions.js
├── Streptococcus_agalactiae_A909.gbk
├── Streptococcus_agalactiae_A909.json
├── Streptococcus_agalactiae_A909.zip
└── svg

Running antiSMASH over a list of files

Now, imagine that you want to run antiSMASH over all Streptococcus agalactiae annotated genomes. Use a for cycle and * as a wildcard to run antiSMASH over all files that start with “S” in the annotated directory.

BASH

$ for gbk_file in ~/pan_workshop/results/annotated/*/S*.gbk
> do
>	antismash --genefinding-tool=none $gbk_file
> done

Visualizing antiSMASH results


In order to see the results after an antiSMASH run, we need to access to the index.html file. Where is this file located?

BASH

$ cd Streptococcus_agalactiae_A909_prokka
$ pwd

OUTPUT

~/pan_workshop/results/antismash/Streptococcus_agalactiae_A909_prokka

As outcomes you should get a folder comprised mainly by the following files:

  • .gbk files For each Biosynthetic Gene cluster region found.
  • .json file To know the input file name, the antiSMASH used version and the regions data (id,sequence_data).
  • index.html file To visualize the outcomes from the analysis.

In order to access these results, we can use scp protocol to download the directory in your local computer.

If using scp , on your local machine, open a GIT bash terminal in the destiny folder and execute the following command:

BASH

$ scp -r user@bioinformatica.matmor.unam.mx:~/pan_workshop/results/antismash/S*A909.prokka/* ~/Downloads/.

If using R-studio then in the left panel, choose the “more” option, and “export” your file to your local computer. Decompress the Streptococcus_agalactiae_A909.prokka.zip file.

Another way to download the data to your computer, you first compress the folder and download the compressed file from JupyterHub

BASH

$ cd ~/pan_workshop/results/antismash
$ zip -r Streptococcus_agalactiae_A909_prokka.zip Streptococcus_agalactiae_A909_prokka
$ ls

In the JupyterHub, navigate to the pan_workshop/results/antismash/ folder, select the file we just created, and press the download button at the top

Once in your local machine, in the directory, Streptococcus_agalactiae_A909.prokka, open the index.html file on your local web browser.

Understanding the output


The visualization of the results includes many different options. To understand all the possibilities, we suggest reading the following tutorial:

Briefly, in the “Overview” page ´.HTML´ you can find all the regions found within every analyzed record/contig (antiSMASH inputs), and summarized all this information in five main features:

  • Region: The region number.
  • Type: Type of the product detected by antiSMASH.
  • From,To: The region’s location (nucleotides).
  • Most similar known cluster: The closest compound from th MIBiG database.
  • Similarity: Percentage of genes within the closest known compound that have significant BLAST hit (The last two columns containing comparisons to the MIBiG database will only be shown if antiSMASH was run with the KnownClusterBlast option ´–cc-mibig´).

antiSMASH web services


antiSMASH can also be used in an online server in the antiSMASH website: You will be asked to give your email. Then, the results will be sent to you and you will be allowed to donwload a folder with the annotations.

Exercises and discussion

Challenge

Exercise 2: NCBI and antiSMASH webserver

Run antiSMASH web server over S. agalactiae A909. First, explore the NCBI assembly database to obtain the accession. Get the id of your results.

  1. Go to NCBI and search S. agalactiae A909.
  2. Choose the assembly database and copy the GenBank sequence ID at the bottom of the site.
  3. Go to antiSMASH
  4. Choose the accession from NCBI and paste CP000114.1
  5. Run antiSMASH and paste into the collaborative document your results id example bacteria-cbd13a47-8095-495f-957c-dcf58c261529
Challenge

Exercise 3: (*Optional exercise ) Run antiSMASH over thermophilus

Try Download and annotate S. thermophilus genomes strains LMD-9 (MiBiG), and strain CIRM-BIA 65 reference genome. With the following information, generate the script to run antiSMASH only on the S. thermophilus annotated genomes.

BASH

 done  
  antismash --genefinding-tool=none $__  
 for ___ in  ____________  
do  

Guide to solve this problem First download genomes with genome-download. Remember to activate the conda environment!

Download the annotated genomes, and then the genbank files. 0. ncbi-genome-download --formats genbank --genera "Streptococcus thermophilus" -S "LMD-9" -o thermophilus bacteria ncbi-genome-download --formats fasta --genera "Streptococcus thermophilus" -S "CIRM-BIA 65" -n bacteria Then, move the genomes to the directory ~/pan_workshop/results/annotated/S*.gbk

  1. First, we need to start the for cycle: for mygenome in ~/pan_workshop/results/annotated/S*.gbk
    note that we are using the name mygenome as the variable name in the for cycle.

  2. Then you need to use the reserved word do to start the cycle.

  3. Then you have to call antiSMASH over your variable mygenome. Remember the $ before your variable to indicate to bash that now you are using the value of the variable.

  4. Finally, use the reserved word done to finish the cycle.

BASH

 for variable in ~/pan_workshop/results/annotated/t*.gbk  
do  
	antismash --genefinding-tool=none $mygenome  
done   
Challenge

Exercise 4: The size of a region

Sort the structure of the next commands that attempt to know the size of a region:

SOURCE ORGANISM LOCUS
head grep

BASH

 $ _____  region.gbk  
 $ _____ __________ region.gbk  

Apply your solution to get the size of the first region of S. agalactiae A909

BASH

$ cd ~/pan_workshop/results/antismash  
$ head Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk
$ grep LOCUS Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk

Locus contains the information of the characteristics of the sequence

BASH

LOCUS   	CP000114.1         	42196 bp	DNA 	linear   UNK 13-JUN-2022

In this case the size of the region is 42196 bp

Challenge

Discussion 1: BGC classes in Streptococcus

What BGC classes does the genus Streptococcus have ?

In antiSMASH website output of A909 strain we see clusters of two classes, T3PKS and arylpolyene. Nevertheless, this is only one Streptococcus in order to infer all BGC classes in the genus we need to run antiSMASH in more genomes.

Key Points
  • antiSMASH is a bioinformatic tool capable of identifying, annotating and analysing secondary metabolite BGC
  • antiSMASH can be used as a web-based tool or as stand-alone command-line tool
  • The file extensions accepted by antiSMASH are GenBank, FASTA and EMBL