Introduction to Genome Mining


Figure 1

Complete pipeline of genome mining. From a single genome, this example obtains their BGC and compares them with other BGC from related genomes

Figure 2

Natural products can be produced by bacteria, fungi, plants and animals

Figure 3

BGC arrange example

Figure 4

NRPS animation of fakeomycin

Figure 5

MIBiG layout of the Chloramphenicol gene cluster from _Streptomyces venezuelae_ comprising 17 genes

Secondary metabolite biosynthetic gene cluster identification


Genome Mining Databases


Figure 1

MIBiG website homepage highlighting the search tool

Figure 2

MIBiG website query page

Figure 3

MIBiG website displaying the results from the simple search Streptococcus

Figure 4

antiSMASH website homepage

Figure 5

antiSMASH website query page

Figure 6

antiSMASH website displaying the results from the simple search Streptococcus

Figure 7

a dataframe variable the content of the Streptococcus predicted BGC

Figure 8

the content of the ocurrences grouped by species column

Figure 9

visualization of the BGC content on a heatmap.

Figure 10

visualization of the BGC content of S. agalactiae. on a sctterplot

Figure 11

visualization of the BGC content on a scatterplot

Figure 12

filtered heatmap

BGC Similarity Networks


Figure 1

BIG-SCAPE output as visualized in the web page.
                                  	The overview page is displayed. At the left is the Run Information,
                                  	indicating the date and time at which the analysis was started and completed, as well as the parameters of the run.
                                  	Next is displayed the Input Data, specifying the total number of genomes and the total BGCs,
                                  	in this example 8 and 23, respectively. There are two pie charts, one representing the BGC per Genome
                                  	and the other the BGC per Class. At the right is the Network Overview,
                                  	which allows selection between mix and the different BGC classes.
                                  	From the mix overview it displays the Number of Families, Average number of BGCs per family,
                                  	Max number of BGCs in a family and the Families with MIBiG Reference BGCs.

Figure 2

BIG-SCAPE output as visualized in the web page.
                                  	The overview page displaying the clustered heatmap of the presence/absence of the GCFs,
                                  	each class organized as a column at the base of the table, in each genome,
                                  	which are organized as rows at the right side of the table. Presence is depicted in red with 1 and absence in beige with 0.

Figure 3

BIG-SCAPE similarity network of the complete mix of BGCs
                                  	obtained from the run. A network is represented for each GCF, each dot represents a BGC.
                                  	In this example there are a total of 23 BGCs, of which 6 are singletons, there are 28 links and 11 families.
                                  	Clicking over any of the dots shows the GCF at the right side and allows you to explore it further.

Figure 4

BIG-SCAPE output displaying a tree of phylogenetic distances
                                  	among the BGCs comprised in a GCF. The example shows the GCF 10, comprised by six members.
                                  	Each BGC is represented in the tree by an arrow diagram of the genes and the protein domains in the genes
                                  	corresponding to that cluster.

Homologous BGC Clusterization


Figure 1

Three biomolecules are depicted in three different colors; blue, red and purple. These molecules are present in BGCs of diverse bacterial lineages and in turn grouped into Gene Cluster Families (GCFs). GF1 contains three domains related to the blue molecule and three from the red molecule. GF2 contains four domains of each of the molecules, blue, red and purple. GCF3 and GCF4 contain each three domains associated with the purple biomolecule but different from each other. Every GCF produces a different metabolite, here represented as weapons or tools.

Figure 2

Example of tsv table composed by five columns and two rows.
                                  	The first row contains the title for each column; # Dataset name, Path to folder, Path to taxonomy, Description.

Figure 3

Example of the structure of the input-folder.
                                  	The content of the directory input-folder is displayed in a tree-like format, listing the files and directories inside it.

Figure 4

BiG SLiCE web page output displaying the results obtained from the example run.
                                  	A left panel presents the information generated, composed of seven tabs;
                                  	Summary, Datasets, Runs, Reports, Help, Feedback, and About.
                                  	The rest is a Summary of all datasets and runs included in the output.
                                  	Appearing firstly, the Datasets provided as input, organized as a table with five rows;
                                  	name, #genomes, total, with taxonomy, and description.
                                  	Next, the information about the Runs, also organized as a table with the following rows;
                                  	name, started, finished, status, threshold, #BGCs, and #GCFs..

Figure 5

BiG SLiCE web page output displaying the information obtained from the Run-0001.
                                  	Firstly, it is shown an Overview from the Run. Indicating the Status, when was it Started and Finished,
                                  	as well as the Parameters, the HMM models and the Input.
                                  	Next, the Statistics about the run are shown as two bar-plots.
                                  	The left one plots the average distance to GCF models, whilst the right one shows the amount of BGCs in GCFs.

Figure 6

BiG SLiCE web page output displaying detailed information regarding the BGCs from GCF_7.
                                  	The Arrowers show a gene arrow visualization of the domains that are part of each of the genes of the BGCs belonging to GCF_7.

Figure 7

BiG-FAM main page showing an introduction as well as a graphical representation of the database.
                                  	A left panel displays the available options; Home, GCFs, BGCs, Query, Statistics, Help, Feedback, and About.

Figure 8

BiG-FAM query page with the option for inserting an antiSMASH job and submitting it.
                                  	Below it is described how much of the job is Pending, Processing, and Processed.

Figure 9

BiG-FAM result page indicating firstly an overview of the job; the query which was submitted,
                                  	its status, as well as the time at which it was started and finished.
                                  	Next, a table indicating the BGCs from the database which are related with the query BGCs.
                                  	This is organized as a table with seven rows; query BGC, class, length (kb), completeness, best hit, distance and details

Finding Variation on Genomic Vicinities


Figure 1

CORASON's workflow for sorting phylogenetically BGCs. Given a query gene in a reference BGC and a genomic annotated database,
                                     CORASON firstly searches for query gene homologues, it filters out all genomic vicinities
                                     not related to the reference BGC. Then, CORASON infers a phylogenetic tree and
                                     calculates the frequency of occurrence for each gene family from the reference BGC.
                                     Using the same reference BGC, if a new query gene is selected, CORASON visualizes
                                     a new phylogeny with families containing the same molecular modifications.

Figure 2

CORASON phylogenetic svg reconstruction using cpsG as query gene and _S. agalactiae_ 1000006 as query cluster.
                                    At the bottom, it is displayed the frequency of occurrence for each gene family from the reference BGC,
                                    each with a different color.

Figure 3

CORASON phylogenetic svg reconstruction using cpsG as query gene and _S. agalactiae_ 1000006 as query cluster.
                                    At the bottom, it is displayed the frequency of occurrence for each gene family from the reference BGC,
                                    each with a different color.

Evolutionary Genome Mining


Figure 1

a) EvoMining expansion-and-recruitment pipeline. A group of grey stacked cylinders representing genomes in a database (DB).
                                   	Homologues and expansions of seed enzymes, represented as an orange arrow, from the enzyme DB
                                   	are searched by blastp in the genome DB.
                                   	The outcome is integrated as the expanded enzyme families (EFs) within the genome DB.
                                   	Bidirectional best hits (BBH) of seed enzymes, red arrows, are marked as conserved metabolism.
                                   	The EFs are amplified after being compared against a DB of natural products (NP) biosynthetic enzymes,
                                   	represented by a blue cylinder, to find recruitments defined as enzymes of the family that are part of a MIBiG BGC.
                                   	b) The genome DB, represented by the gray stacked cylinders, is searched as previously described.
                                   	Additionally, antiSMASH predictions, cyan arrows, can be added by the user.
                                   	antiSMASH enzyme predictions that are at the same time marked in red are defined as transition enzymes, purple arrows.
                                   	c) EvoMining phylogenetic reconstruction and visualization. On the left side, a phylogenetic reconstruction of an EF is shown.
                                   	On the right side it is shown the EvoMining tree displaying the EvoMining predictions (green),
                                   	which are those extra copies closer to enzyme recruitments into BGC (blue) than to conserved metabolic enzymes (red).
                                   	antiSMASH predicted enzymes are represented in cyan, transition enzymes in black and
                                   	extra copies that are neither antiSMASH nor EvoMining predictions are left in grey.

Figure 2

EvoMining phylogenetic reconstruction providing evolutionary insights into the metabolic origin
and the fate of members of diverse EF from the Streptococcus example.
Seed enzymes are labeled in orange. The most conserved copies or central metabolism copies are marked in red.
Enzyme copies recruited into specialized metabolism, contained in MIBiG, are labeled in blue.
Enzyme copies that are closer to blue enzyme recruitments than to red conserved enzymes are labeled in green
and represent EvoMining Hits. Extra copies with an unknown metabolic fate are shown in gray.

Figure 3

Select the path to download file

Figure 4

Select the path to download file

Figure 5

MicroReact visualization of the EvoMining run Streptococcus example.
                                             	At the left a bar-chart with the EF in the X axis and the number of entries in the Y axis.
                                             	At the right, the EvoMining phylogenetic tree using the same color code as the chart.
                                             	Right of the tree the legend indicating the colors by metabolism; central metabolism enzymes in red,
                                             	expansion enzymes in gray, recruited enzymes contained in MIBiG in blue,
                                             	secondary metabolism enzymes (EvoMining hits) are marked in green,
                                             	and seed enzymes are colored in orange. Below appears the metadata from the run,
                                             	organized in a five row table including Id, metabolism, genome, function and copies.

GATOR-GC: Genomic Assessment Tool for Orthologous Regions and Gene Clusters


Figure 1

GATOR methods

Figure 2

GATOR methods

Figure 3

Heatmap resulting from cpsg gator-gc analysis

Figure 4

Conservation plot for the first window

Figure 5

Fisrt window_1_neighborhood plot for the first window

Metabolomics workshop


Figure 1

GNPS output can be directly visualized in the GNPS webpage, or using other visualization tools such as [Cytoscape](https://cytoscape.org/)

Figure 2

Create an account in GNPS

Figure 3

MZmine 3, an MS data analysis platform

Figure 4

Data collection from *Streptomyces* sp. H-KF8, and *Streptomyces* sp. Vc74B-19.

Figure 5

Data collection from *Streptomyces* sp. H-KF8, and *Streptomyces* sp. Vc74B-19.

Figure 6

Load batch file

Figure 7

Load batch file

Figure 8

Load batch file

Figure 9

Load batch file

Figure 10

Load batch file

Figure 11

Datasets

Figure 12

Chromatogram

Figure 13

Chromatogram

Figure 14

Chromatogram

Figure 15

Chromatogram

Figure 16

Chromatogram

Figure 17

Feature list

Figure 18

Feature List

Figure 19

Blank substraction

Figure 20

Blank substraction

Figure 21

Blank substraction

Figure 22

Blank substraction

Figure 23

Export files

Figure 24

Export files

Figure 25

Export files

Figure 26

Export files

Figure 27

FBMN

Figure 28

FBMN

Figure 29

FBMN

Figure 30

FBMN

Figure 31

FBMN

Figure 32

FBMN

Figure 33

FBMN

Figure 34

FBMN

Figure 35

FBMN

Figure 36

FBMN

Figure 37

FBMN

Figure 38

FBMN

Figure 39

FBMN

Figure 40

FBMN

Figure 41

FBMN

Figure 42

FBMN

Figure 43

FBMN

Figure 44

FBMN

Figure 45

FBMN

Figure 46

FBMN

Figure 47

FBMN

Figure 48

FBMN

Figure 49

FBMN

Figure 50

FBMN

Figure 51

FBMN

Figure 52

FBMN

Figure 53

FBMN

Figure 54

FBMN

Figure 55

FBMN

Figure 56

FBMN

Other Resources