Homologous BGC Clusterization

Last updated on 2026-02-20 | Edit this page

Overview

Questions

How can I identify Gene Cluster Families?
How can I predict the production of similar metabolites
How can I clusterize BGCs into groups that produce similar metabolites?
How can I compare the metabolic capability of different bacterial lineages?

Objectives

Understand the purpose of using BiG-SLiCE and BiG-FAM
Learn the inputs required to run BiG-SLiCE and BiG-FAM
Run an example with our Streptococcus data on both softwares
Understand the obtained results

BiG-SLiCE and BiG-FAM a set of tools to compare the microbial metabolic diversity

Around the curriculum, we have been learning about the metabolic capability of bacteria lineages encoded in BGCs. Before we start this lesson, let’s remember that the products of these clusters are biomolecules that are (mostly) not essential for life. They take the role of “tools” that offer some ecological and physiological advantage to the bacterial lineage producing it.

The presence/absence of a set of BGCs in a genome, can be associated to the ecological niches and (micro)environments which the lineage faces.

The counterpart of this can be taken as how biosynthetically diverse a bacterial lineage can be due to its environment.

Three biomolecules are depicted in three different colors; blue, red and purple. These molecules are present in BGCs of diverse bacterial lineages and in turn grouped into Gene Cluster Families (GCFs). GF1 contains three domains related to the blue molecule and three from the red molecule. GF2 contains four domains of each of the molecules, blue, red and purple. GCF3 and GCF4 contain each three domains associated with the purple biomolecule but different from each other. Every GCF produces a different metabolite, here represented as weapons or tools.

To answer some of this questions, we need a way to compare the metabolic capabilities of a set of diverse bacterial lineages. One way to do this is to compare the state (identity) of homologous BGCs. The homologous clusters that share a similar domain architecture and in consequence will produce a similar metabolite, are grouped into a bigger group that we call Gene Cluster Families (GCFs).

There are some programs capable of clustering BGCs in GCFs, e.g.ClusterFinder and DeepBGC. BiG-SLICE is a software that has been designed to do this work in less time, even when dealing with large datasets (1.2 million BGCs). This characteristic makes it a good tool to compare the metabolic diversity of several bacterial lineages.

The input for BiG-SLiCE

Firstly, activate the conda environment, where BiG-SLiCE has been installed as well as all the softwares required for its usage:

BASH

$ conda activate GenomeMining_Global

You will have now a (GenomeMining) label at the beginning of the prompt line.

If we seek for help using the --help flag of bigslice, we can get a first glance at the input that we need:

BASH

$ bigslice --help | head -n20

OUTPUT

usage: bigslice [-i <folder_path>] [--resume] [--complete] [--threshold T]
            	[--threshold_pct <N>] [--query <folder_path>]
            	[--query_name <name>] [--run_id <id>] [--n_ranks N_RANKS]
            	[-t <N>] [--hmmscan_chunk_size <N>] [--subpfam_chunk_size <N>]
            	[--extraction_chunk_size EXTRACTION_CHUNK_SIZE] [--scratch]
            	[-h] [--program_db_folder PROGRAM_DB_FOLDER] [--version]
            	<output_folder_path>

                        	_________
 ___	_____|\____|\____|\ \____	\
|   \  |   	} 	} 	}  ___)_ _/__
| >  | _--/--|/----|/----|/__/  __||  __|
|   < | ||  __/ (  (| | \___/  /__/|  _|
| >  || || |_ | _)  ) |_ | |\  \__ | |__
|____/|_| \___/|___/|___||_| \____||____| [ Version 1.1.0 ]

Biosynthetic Gene clusters - Super Linear Clustering Engine
(https://github.com/medema-group/bigslice)

positional arguments:

BiG-SLiCE is asking for a folder as the input. It is important to highlight that the bones of this folder are the BGCs from a group of genomes that has been obtained by antiSMASH. In its GitHub page, we can search for an input folder template. The folder needs three main components:

datasets.tsv: This is a file that needs to have this exact name. This must be a tabular separated file where the information of each dataset (BGCs from a group of genomes) must be specified.

dataset_n: Each dataset will be composed by the BGCs of different genomes. We can give BiG-SLiCE n groups of BGCs. The separation of the BGCs in different datasets is due to the different characteristics that the bacterial lineages can have. For example, if we have a set of bacterial populations that came from maize roots and other obtained from tomato roots, we can put them into dataset_1 and dataset_2 respectively.
**taxonomy_n.tsv:**This is also a tabular separated file that will carry the taxonomic information of each of the genomes where the input BGCs were found and the path to reach these BGCs inside the input-folder:
1. Genome folder name (ends with ‘/’)
2. Kingdom / Domain name
3. Class name
4. Order name
5. Family name
6. Genus name
7. Species name
8. Organism / Strain name

Here, we have an example of the structure of the input-folder:

Creating the input-folder

We will create a bigslice folder inside the results directory.

BASH

$ cd ~/gm_workshop/results/genome_mining
$ mkdir bigslice
$ cd bigslice

Now that we know what input BiG-SLiCE requires, we will do our input-folder step by step. First, let’s remember how many genomes we have:

BASH

$ ls -F ../antismash

OUTPUT

agalactiae_18RS21/  agalactiae_A909/	agalactiae_COH1/
agalactiae_515/ 	agalactiae_CJB111/  agalactiae_H36B/
athermophilus_LMD-9/  athermophilus_LMG_18311/

We have six Streptococcus genomes from the original paper, and two
public genomes. We will allocate their respective BGC .gbks in two datasets, one for the genomes from Tettelin et al. paper, and the other for the public ones.

BASH

$ mkdir input-folder
$ mkdir input-folder/dataset_1
$ mkdir input-folder/dataset_2
$ mkdir input-folder/taxonomy
$ tree

OUTPUT

.
└── input-folder
	├── dataset_1
	├── dataset_2
	└── taxonomy

4 directories, 0 files

Now we will create the folders for the BGCs of the dataset_1 genomes. We will take advantage of the strain name of our Streptococcus agalactiae lineages to create their folders:

BASH

$ for i in 18RS21 COH1 515 H36B A909 CJB111;
  do mkdir input-folder/dataset_1/Streptococcus_agalactiae_$i;
  done
$ tree -F

OUTPUT

.
└── input-folder/
	├── dataset_1/
	│   ├── Streptococcus_agalactiae_18RS21/
	│   ├── Streptococcus_agalactiae_515/
	│   ├── Streptococcus_agalactiae_A909/
	│   ├── Streptococcus_agalactiae_CJB111/
	│   ├── Streptococcus_agalactiae_COH1/
	│   └── Streptococcus_agalactiae_H36B/
	├── dataset_2/
	└── taxonomy/

10 directories, 0 files

We will do the same for the public genomes:

BASH

$ for i in LMD-9 LMG_18311;
  do mkdir input-folder/dataset_2/Streptococcus_thermophilus_$i;
  done
$ tree -F

BASH

.
└── input-folder/
	├── dataset_1/
	│   ├── Streptococcus_agalactiae_18RS21/
	│   ├── Streptococcus_agalactiae_515/
	│   ├── Streptococcus_agalactiae_A909/
	│   ├── Streptococcus_agalactiae_CJB111/
	│   ├── Streptococcus_agalactiae_COH1/
	│   └── Streptococcus_agalactiae_H36B/
	├── dataset_2/
	│   ├── Streptococcus_thermophilus_LMD-9/
	│   └── Streptococcus_thermophilus_LMG_18311/
	└── taxonomy/

12 directories, 0 files

Next, we will copy the BGCs .gbk files from the dataset_1 genomes from each AntiSMASH output directory to its respective folder .

BASH

$ ls input-folder/dataset_1/ | while read line;
  do cp ../antismash/output/$line/*region*.gbk input-folder/dataset_1/$line;
  done
$ tree -F

OUTPUT

.
└── input-folder/
	├── dataset_1/
	│   ├── Streptococcus_agalactiae_18RS21/
	│   │   ├── AAJO01000016.1.region001.gbk
	│   │   ├── AAJO01000043.1.region001.gbk
	│   │   └── AAJO01000226.1.region001.gbk
	│   ├── Streptococcus_agalactiae_515/
	│   │   ├── AAJP01000027.1.region001.gbk
	│   │   └── AAJP01000037.1.region001.gbk
	│   ├── Streptococcus_agalactiae_A909/
	│   │   ├── CP000114.1.region001.gbk
	│   │   └── CP000114.1.region002.gbk
	│   ├── Streptococcus_agalactiae_CJB111/
	│   │   ├── AAJQ01000010.1.region001.gbk
	│   │   └── AAJQ01000025.1.region001.gbk
	│   ├── Streptococcus_agalactiae_COH1/
	│   │   ├── AAJR01000002.1.region001.gbk
	│   │   └── AAJR01000044.1.region001.gbk
	│   └── Streptococcus_agalactiae_H36B/
	│   	├── AAJS01000020.1.region001.gbk
	│   	└── AAJS01000117.1.region001.gbk
	├── dataset_2/
	│   ├── Streptococcus_thermophilus_LMD-9/
	│   └── Streptococcus_thermophilus_LMG_18311/
	└── taxonomy/

12 directories, 13 files

Again, we will do the same for the BGCs .gbks of the dataset_2:

BASH

$ ls input-folder/dataset_2/ | while read line;
  do cp ../antismash/output/$line/*region*.gbk input-folder/dataset_2/$line;
  done
$ tree -F

BASH

.
└── input-folder/
	├── dataset_1/
	│   ├── Streptococcus_agalactiae_18RS21/
	│   │   ├── AAJO01000016.1.region001.gbk
	│   │   ├── AAJO01000043.1.region001.gbk
	│   │   └── AAJO01000226.1.region001.gbk
	│   ├── Streptococcus_agalactiae_515/
	│   │   ├── AAJP01000027.1.region001.gbk
	│   │   └── AAJP01000037.1.region001.gbk
	│   ├── Streptococcus_agalactiae_A909/
	│   │   ├── CP000114.1.region001.gbk
	│   │   └── CP000114.1.region002.gbk
	│   ├── Streptococcus_agalactiae_CJB111/
	│   │   ├── AAJQ01000010.1.region001.gbk
	│   │   └── AAJQ01000025.1.region001.gbk
	│   ├── Streptococcus_agalactiae_COH1/
	│   │   ├── AAJR01000002.1.region001.gbk
	│   │   └── AAJR01000044.1.region001.gbk
	│   └── Streptococcus_agalactiae_H36B/
	│   	├── AAJS01000020.1.region001.gbk
	│   	└── AAJS01000117.1.region001.gbk
	├── dataset_2/
	│   ├── Streptococcus_thermophilus_LMD-9/
	│   │   ├── CP000419.1.region001.gbk
	│   │   ├── CP000419.1.region002.gbk
	│   │   ├── CP000419.1.region003.gbk
	│   │   ├── CP000419.1.region004.gbk
	│   │   └── CP000419.1.region005.gbk
	│   └── Streptococcus_thermophilus_LMG_18311/
	│   	├── NC_006448.1.region001.gbk
	│   	├── NC_006448.1.region002.gbk
	│   	├── NC_006448.1.region003.gbk
	│   	├── NC_006448.1.region004.gbk
	│   	├── NC_006448.1.region005.gbk
	│   	└── NC_006448.1.region006.gbk
	└── taxonomy/

12 directories, 24 files

Challenge

Exercise 1. Input-folder structure

As we have seen, the structure of the input folder for BiG-SLiCE is quite difficult to get. Imagine “Sekiro” wants to copy the directory structure to use it for future BiG-SLiCE inputs. Consider the following directory structure:

OUTPUT

  └── input_folder/               	
   	├── datasets.tsv      	
   	├── dataset_1/
   	|   ├── genome_1A/
   	|   |   ├── genome_1A.region001.gbk
   	|   |   └── genome_1A.region002.gbk
   	|   ├── genome_1B/
   	|   |   ├── genome_1B.region001.gbk
   	|   |   └── genome_1B.region002.gbk
   	├── dataset_2/
   	|   ├── genome_2A/
   	|   |   ├── genome_2A.region001.gbk
   	|   |   └── genome_2A.region002.gbk
   	|   ├── genome_2B/
   	|   |   ├── genome_2B.region001.gbk
   	|   |   └── genome_2B.region002.gbk
   	└── taxonomy/
       	├── taxonomy_dataset_1.tsv
       	└── taxonomy_dataset_2.tsv

Which would be the commands to copy the directory structure without the files (just the folders)?
a)

BASH

$ cp -r input_folder input_folder_copy
$ cd input_folder_copy
$ rm -r

BASH

$ mv -r input_folder input_folder_copy
$ rm input_folder_copy/dataset_1/genome_1A/* input_folder_copy/dataset_1/genome_1B/*
$ rm input_folder_copy/dataset_2/genome_2A/* input_folder_copy/dataset_2/genome_2B/*
$ rm input_folder_copy/taxonomy/*

BASH

$ cp -r input_folder input_folder_copy
$ rm input_folder_copy/dataset_1/genome_1A/* input_folder_copy/dataset_1/genome_1B/*
$ rm input_folder_copy/dataset_2/genome_2A/* input_folder_copy/dataset_2/genome_2B/*
$ rm input_folder_copy/taxonomy/*

BASH

$ cp -r input_folder input_folder_copy
$ rm input_folder_copy/dataset_1/genome_1A/* /genome_1B/*
$ rm input_folder_copy/dataset_2/genome_2A/* /genome_2B/*
$ rm input_folder_copy/taxonomy/*

Show me the solution

First we need to copy the whole directory to make an input_folder_copy. After that we want to erase the files inside the directories dataset_1/genome_1A, dataset_1/genome_1B, dataset_2/genome_2A, dataset2/genome_2B and taxonomy/
c)

BASH

$ cp -r input_folder input_folder_copy
$ rm input_folder_copy/dataset_1/genome_1A/* input_folder_copy/dataset_1/genome_1B/*
$ rm input_folder_copy/dataset_2/genome_2A/* input_folder_copy/dataset_2/genome_2B/*
$ rm input_folder_copy/taxonomy/*

datasets.tsv file - The silmaril of BiG-SLiCE

The next step is to create the main file datasets.tsv. First we put the first lines to this file. In each of these .tsv files, we can put lines that begin with #. This will not be read by the code, and it can be of help to know which information is located in those files. We will use the command echo with the -e flag that will enable us to put tab separations between the text:

BASH

$ cd input-folder
$ echo -e "#Dataset name""\t""Path to folder""\t""Path to taxonomy""\t""Description" > datasets.tsv
$ more datasets.tsv

OUTPUT

#Dataset name   Path to folder  Path to taxonomy      Description

We will use echo One more time to put the information of our two datasets. But this time, we will indicate bash that we want to insert this new line in the existing datasets.tsv with the >> option.

BASH

$ echo -e "dataset_1""\t""dataset_1/""\t""taxonomy/dataset_1_taxonomy.tsv""\t""Tetteling genomes" >> datasets.tsv
$ echo -e "dataset_2""\t""dataset_2/""\t""taxonomy/dataset_2_taxonomy.tsv""\t""Public genomes" >> datasets.tsv
$ more datasets.tsv

OUTPUT

#Dataset name   Path to folder  Path to taxonomy    	Description
dataset_1   	dataset_1/  	taxonomy/dataset_1_taxonomy.tsv Tetteling genomes
dataset_2   	dataset_2/  	taxonomy/dataset_2_taxonomy.tsv Public genomes

The main file of the input-folder is finished!

The taxonomy of the input-folder

We also need to specify the taxonomic assignation of each of the input genomes BGC’s. We will build one dataset_taxonomy.tsv file for each one of our datasets, i.e. two in total. Firstly, create the header of each one of these files using echo:

BASH

$ echo -e "#Genome folder""\t"Kingdom"\t"Phylum"\t"Class"\t"Order"\t"Family"\t"Genus"\t""Species Organism" > taxonomy/dataset_1_taxonomy.tsv
$ echo -e "#Genome folder""\t"Kingdom"\t"Phylum"\t"Class"\t"Order"\t"Family"\t"Genus"\t""Species Organism" > taxonomy/dataset_2_taxonomy.tsv

$ more taxonomy/dataset_1_taxonomy.tsv

OUTPUT

#Genome folder  Kingdom Phylum  Class   Order   Family  Genus   Species Organism

There are several ways to obtain the taxonomic data. One of them is searching for each lineage taxonomic identification on NCBI. There is also a way to get this information from our .gbk files:

BASH

$ head -n20 ../../annotated/Streptococcus_agalactiae_18RS21.gbk

OUTPUT

LOCUS   	AAJO01000169.1      	2501 bp	DNA 	linear   UNK
DEFINITION  Streptococcus agalactiae 18RS21
ACCESSION   AAJO01000169.1
KEYWORDS	.
SOURCE  	Streptococcus agalactiae 18RS21.
  ORGANISM  Streptococcus agalactiae 18RS21
        	Bacteria; Terrabacteria group; Firmicutes; Bacilli;
        	Lactobacillales; Streptococcaceae; Streptococcus; Streptococcus
        	agalactiae.
FEATURES         	Location/Qualifiers
 	source      	1..2501
                 	/mol_type="genomic DNA"
                 	/db_xref="taxon:342613"
                 	/genome_md5=""
                 	/project="nselem35_342613"
                 	/genome_id="342613.37"
                 	/organism="Streptococcus agalactiae 18RS21"
 	CDS         	1..213
                 	/db_xref="SEED:fig|342613.37.peg.1973"
                 	/db_xref="GO:0003735"

We will use again echo to write this information into our dataset_taxonomy.tsv. This is going to be a little long because we have 8 genomes and we will need to write 8 echo lines, but bear with us. We are almost there!

BASH

$ echo -e "Streptococcus_agalactiae_18RS21/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus agalactiae 18RS21" >> taxonomy/dataset_1_taxonomy.tsv

$ echo -e "Streptococcus_agalactiae_515/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus agalactiae 515" >> taxonomy/dataset_1_taxonomy.tsv

$ echo -e "Streptococcus_agalactiae_A909/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus agalactiae A909" >> taxonomy/dataset_1_taxonomy.tsv

$ echo -e "Streptococcus_agalactiae_CJB111/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus agalactiae CJB111" >> taxonomy/dataset_1_taxonomy.tsv

$ echo -e "Streptococcus_agalactiae_COH1/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus agalactiae COH1" >> taxonomy/dataset_1_taxonomy.tsv

$ echo -e "Streptococcus_agalactiae_H36B/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus agalactiae H36B" >> taxonomy/dataset_1_taxonomy.tsv

$ more taxonomy/dataset_1_taxonomy.tsv

OUTPUT

#Genome folder   Kingdom Phylum  Class   Order   Family  Genus   Species Organism
Streptococcus_agalactiae_18RS21/    	Bacteria    	Firmicutes  	Bacilli Lactobacillales Strept
ococcaceae  	Streptococcus   Streptococcus agalactiae 18RS21
Streptococcus_agalactiae_515/   Bacteria    	Firmicutes  	Bacilli Lactobacillales Streptococcace
ae  	Streptococcus   Streptococcus agalactiae 515
Streptococcus_agalactiae_A909/  Bacteria    	Firmicutes  	Bacilli Lactobacillales Streptococcace
ae  	Streptococcus   Streptococcus agalactiae A909
Streptococcus_agalactiae_CJB111/    	Bacteria    	Firmicutes  	Bacilli Lactobacillales Strept
ococcaceae  	Streptococcus   Streptococcus agalactiae CJB111
Streptococcus_agalactiae_COH1/  Bacteria    	Firmicutes  	Bacilli Lactobacillales Streptococcace
ae  	Streptococcus   Streptococcus agalactiae COH1
Streptococcus_agalactiae_H36B/  Bacteria    	Firmicutes  	Bacilli Lactobacillales Streptococcace
ae  	Streptococcus   Streptococcus agalactiae H36B

We will do the same for the dataset_2_taxonomy.tsv file:

BASH

$ echo -e "Streptococcus_thermophilus_LMD-9/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus thermophilus LMD-9" >> taxonomy/dataset_2_taxonomy.tsv

$ echo -e "Streptococcus_thermophilus_LMG_18311/""\t"Bacteria"\t"Firmicutes"\t"Bacilli"\t"Lactobacillales"\t"Streptococcaceae"\t"Streptococcus"\t""Streptococcus thermophilus LMG 18311" >> taxonomy/dataset_2_taxonomy.tsv

$ more taxonomy/dataset_2_taxonomy.tsv

OUTPUT

#Genome folder   Kingdom Phylum  Class   Order   Family  Genus   Species Organism
Streptococcus_thermophilus_LMD-9/   	Bacteria    	Firmicutes  	Bacilli Lactobacillales Strept
ococcaceae  	Streptococcus   Streptococcus thermophilus LMD-9
Streptococcus_thermophilus_LMG_18311/   Bacteria    	Firmicutes  	Bacilli Lactobacillales Strept
ococcaceae  	Streptococcus   Streptococcus thermophilus LMG 18311

Now, we have the organization of our input-folder complete

BASH

$ cd ..
$ tree -F input-folder/

OUTPUT

input-folder/
├── dataset_1/
│   ├── Streptococcus_agalactiae_18RS21/
│   │   ├── AAJO01000016.1.region001.gbk
│   │   ├── AAJO01000043.1.region001.gbk
│   │   └── AAJO01000226.1.region001.gbk
│   ├── Streptococcus_agalactiae_515/
│   │   ├── AAJP01000027.1.region001.gbk
│   │   └── AAJP01000037.1.region001.gbk
│   ├── Streptococcus_agalactiae_A909/
│   │   ├── CP000114.1.region001.gbk
│   │   └── CP000114.1.region002.gbk
│   ├── Streptococcus_agalactiae_CJB111/
│   │   ├── AAJQ01000010.1.region001.gbk
│   │   └── AAJQ01000025.1.region001.gbk
│   ├── Streptococcus_agalactiae_COH1/
│   │   ├── AAJR01000002.1.region001.gbk
│   │   └── AAJR01000044.1.region001.gbk
│   └── Streptococcus_agalactiae_H36B/
│   	├── AAJS01000020.1.region001.gbk
│   	└── AAJS01000117.1.region001.gbk
├── dataset_2/
│   ├── Streptococcus_thermophilus_LMD-9/
│   │   ├── CP000419.1.region001.gbk
│   │   ├── CP000419.1.region002.gbk
│   │   ├── CP000419.1.region003.gbk
│   │   ├── CP000419.1.region004.gbk
│   │   └── CP000419.1.region005.gbk
│   └── Streptococcus_thermophilus_LMG_18311/
│   	├── NC_006448.1.region001.gbk
│   	├── NC_006448.1.region002.gbk
│   	├── NC_006448.1.region003.gbk
│   	├── NC_006448.1.region004.gbk
│   	├── NC_006448.1.region005.gbk
│   	└── NC_006448.1.region006.gbk
├── datasets.tsv
└── taxonomy/
	├── dataset_1_taxonomy.tsv
	└── dataset_2_taxonomy.tsv

11 directories, 27 files

Challenge

Exercise 2. Taxonomic information

If you use the head -n20 on one of your .gbk files, you will obtain an output like this

OUTPUT

LOCUS   	AAJO01000169.1      	2501 bp	DNA 	linear   UNK
DEFINITION  Streptococcus agalactiae 18RS21
ACCESSION   AAJO01000169.1
KEYWORDS	.
SOURCE  	Streptococcus agalactiae 18RS21.
 ORGANISM  Streptococcus agalactiae 18RS21
       	Bacteria; Terrabacteria group; Firmicutes; Bacilli;
       	Lactobacillales; Streptococcaceae; Streptococcus; Streptococcus
       	agalactiae.
FEATURES         	Location/Qualifiers
	source      	1..2501
                	/mol_type="genomic DNA"
                	/db_xref="taxon:342613"
                	/genome_md5=""
                	/project="nselem35_342613"
                	/genome_id="342613.37"
                	/organism="Streptococcus agalactiae 18RS21"
	CDS         	1..213
                	/db_xref="SEED:fig|342613.37.peg.1973"
                	/db_xref="GO:0003735"

As you can see, in the gbk file we can find the taxonomic information that we need to fill the taxonomic information of each genome. Fill in the blank spaces to complete the command that would get the organism information from the .gbk file.

$ grep ___________ -m1 ____________ Streptococcus_agalactiae_18RS21.gbk

-B 3 -A 3 -a ‘Streptococcus’ ‘ORGANISM’ ‘gbk’

To obtain the next result:

OUTPUT

 ORGANISM  Streptococcus agalactiae 18RS21
       	Bacteria; Terrabacteria group; Firmicutes; Bacilli;
       	Lactobacillales; Streptococcaceae; Streptococcus; Streptococcus
       	agalactiae.

Key Points

You can use the grep --help command to get information about the available options for the grep command

Show me the solution

We need to use the grep command to search for the word “ORGANISM” in the file. Once located we just want to get the first instance of the word that is why we use the -m1 flag. Then we want the 3 next lines of the file after the instance, because that’s where the information is. So the final command would be:

BASH

$ grep 'ORGANISM' -m1 -A3 Streptococcus_agalactiae_18RS21.gbk

Running BiG-SLiCE

We are ready to run BiG-SLICE. After we type the next line of command, it will take close to 3 minutes long to end the process:

BASH

$ bigslice -i input-folder output-bigslice

OUTPUT

...
...
BiG-SLiCE run complete!

In order to see the results, we need to download the output-bigslice to our local computer. We will use the scp command to accomplish this.

Open another terminal and without connecting it to the remote computer (i.e. in your own computer), move to a directory where you know you can save the folder. In this example we will copy it to the Documents folder of our computer.

BASH

$ cd Documents/
$ scp -r virtual-computerbetterlab@###.###.###.##/home/virtual-User/dc_workshop/results/bigslice/output-bigslice .
$ ls -F

OUTPUT

output-bigslice/

The BiG-SLiCE prepared the code so that in each output folder we will obtain the output files and some scripts that will generate a way to visualize the data.

BASH

$ ls output-bigslice/

OUTPUT

app  LICENSE.txt  requirements.txt  result  start_server.sh

The file requirements.txt contains the names of two softwares that need to be installed to use the visualization: SQLite3 and flask. We will install these with the next command:

BASH

$ pip install -r output-bigslice/requirements.txt

OUTPUT

...
...
Successfully installed pysqlite3

After the installation, we can use the start_server.sh to generate a

BASH

$ bash output-bigslice/start_server.sh

OUTPUT

 * Serving Flask app "/mnt/d/documentos/sideprojects/sc-mineria/bigSlice/output-bigslice/app/run.py"
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

The result looks a little intimidating, but if you obtained a result that looks like the above one do not worry. Next, we need to open an internet browser. In a new tab type the next line.

http://localhost:5000/

As a result, we will obtain a web-page that looks like this:

Here we have a summary of the obtained results. In the left part, we see a set of 7 fields. 3 of them have the information that we generated: Summary, Datasets, and Runs. The other ones are for information regarding the software, one of them is to give feedback to the developers. You can acces to the Datasets and Runs page also from the options that we see in the main part of this summary page. The first part of this summary describes the datasets that we provided as input. The second one is the information of the process (when we ran the program) that BiG-SLiCE carry out with our data.

If we click into the Runs field on the left or the run-0001 blue button on the Runs main section, we will jump into the page that gives us the information regarding the BGCs GCFs that were obtained:

In this new page we have two bar-plots that give us information regarding if the BGCs that we sumministrated were classified as fragmented or complete by AntiSMASH (according to the AntiSMASH algorithm, a BGC can be classified as fragmented if it is found at the edge of a contig), and how many families were obtained with singletons (composed of just one BGC) and of 2-3 BGCs.

If we go to the bottom of this page, we will found a table of all the GCFs with information of the number of BGCs that compose them, the representative BGC class inside each family, and which genus has more BGCs that belong to each family.

Click the + symbol of the GCF_7, and then on the View button that is displayed. This will take us to a page with statistics of the BGCs that are part of this family.

Inside this section, we can found a great amount of information regarding the GCF_7 BGC’s. If we jump to section called Members, we can click on the view BGC arrowers button to get a visualization of the domains that are part of each of the genes of these BGCs.

As we can see, this tool can be used to clusterize homologous BGCs that share a structure. With this knowledge, we can try some hypothesis or even build new ones in the light of this information.

Discussion

Considering the results in the runs tab, we can see that there are a lot of Gene Cluster Families that have only one cluster associated to it (GCF_1, GCF_2, GCF_4, GCF_5, GCF_6, GCF_8, GCF_9, GCF_13, and GCF_15).
What does this tell us about the metabolites diversity of the Streptococcus sample?

Show me the solution

We can see that there are some unique gene cluster families, which means that there are an organism that has that unique gene cluster and therefore it has some metabolite that the others don´t and that may be advantageous in its environment. Since there are many singleton Gene Cluster Families, we can say that there is a big diversity from the sample because many of the organisms in the sample have unique “abilities” (metabolites).

Discussion

Suppose we get the following results from a sample of 6 organisms.

|——————-+——————————————————–| | GCF | #core members | |:—————–:+:——————————————————:| | GCF_1 | 6 | |——————-+——————————————————–| | GCF_2 | 5 | |——————-+——————————————————–| | GCF_3 | 5 | |——————-+——————————————————–| | GCF_4 | 6 | |——————-+——————————————————–|

What can we infer about the metabolite diversity?

Show me the solution

As most of the organisms have a gene cluster in the gene cluster family, we can infer that most of them have the same secondary metabolites, which means that there is little metabolite diversity between them.

BiG-FAM

To demonstrate the functionality of BiG-SLICE, the authors gathered close to 1.2 million of BGCs. All this information, they put it on a great database that they called BiG-FAM. When you get to the main page, you will face a structure that looks like the one we saw on the BiG-SLICE results:

One of the options on the left is Query. If we click here, we will move to a new page where we can insert an antiSMASH job ID. This will run a comparison of all the BGCs found on that antiSMASH run on a single genome, against the immense database.

We will use the antiSMASH job ID from the analysis made on Streptococcus agalactiae A909. The antiSMASH job ID can be found in the URL direction of this page in the next format: taxon-aaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee.

We will use the job ID that we obtained:

bacteria-f931deb5-509b-4f28-a210-212456c5139b

And click the submit option.

This will generate a result page that indicates with which BGCs from the database, the BGCs from the query genome are related.

Key Points

BiG-SLiCE and BiG-FAM are softwares that are useful to compare the metabolic diversity of bacterial lineages between each other and against a big database
An input-folder containing the BGCs from antiSMASH and the taxonomic information of each genome is needed to run BiG-SLiCE
The results from the antiSMASH web-tool are needed to run BiG-FAM
Gene Cluster Families can help us to compare the metabolic capabilities of a set of bacterial lineages
We can use BiG-FAM to compare a BGC against the whole database and predict its Gene Cluster Family