Next-generation sequencing, Databases, Ontology and Tool Development

Bioinformatics & Systems Biology

The theoretical subgroup is involved in a wide range of projects exploiting next-generation sequencing (NGS) technologies, primarily on the Illumina platform but also increasingly on newer long-read platforms such as Pacfic Biosciences. 

Genome sequencing and application to breeding research

Genome sequencing projects range from de-novo sequencing of the genome and/or transcriptome of species and the analysis of genetic variation in natural population or mutant lines, to the analysis of differential gene expression (RNA-Seq) and epigenetic datasets. We are involved in several NGS based projects which are led by the Usadel group, and additionally collaborate with other groups in RWTH and internationally. Our experience in NGS projects has led to the development of novel algorithms and widely-adopted tools for the processing of NGS datasets such as Trimmomatic and RobiNA.

We mostly focus on plant species producing high value plant natural products and on the analysis of Solanaceous plants in general.

A sample of current projects:

  • Solanum pennellii (first genome version published)
  • Solanum lycopersicoides (ERA-CAPs funded Regulatome)

We are also developing new approaches for functional annotation of genes, building on our experience within the Mapman project, and especially the Mercator functional annoation pipeline. We are also evaluting new methods of transcript correlation as a source of protein information which is complementary to established sequence-based

Based on genome sequencing we develop applications to determine the genotype for e.g. GWAS analysis by standard tools which we plug into novel visualization tools.

Omics Data Visualization and Analysis

In order to be able to interpret and visualize plant omics data, biological knowledge needs to be available in a structured form, therefore the MapMan tool was developed. This tool utilizes its own redundancy-reduced ontology, initially featuring a few hundred categories, in order to facilitate visualization and statistical analysis of ‘omics data for the model plant Arabidopsis.

The software tool and the underlying ontology were subsequently further developed, the ontology was extended to support more functional processes and categories (~2500) and more than 20 crop species, in addition to Arabidopsis.  Recently the knowledge from the Aramemnon database was integrated. These ontologies and tools are widely used in the international community and are being linked into the Arabidopsis information portal, PLAZA (Ghent) the Biology Array Resource (BAR; Toronto) and others (Arabidopsis Information Portal). A database with a web-based curator tool was established to track protein annotation information and related publications ( Building on this database, a pipeline termed Mercator was generated to automatically annotate functional classes for plant proteins (Lohse et al., 2014).

Currently we are revisiting the MapMan structure and are compiling an ALL-SEQEUNCED-PLANT new MapMan release. Initially this will focus on primary and secondary metabolism.

The visualisation and analysis tools were extended by integrating additional modules allowing quick compact visualization of omics data sets on one page (PAGEMAN, Usadel et al 2006), and comparison of omics data sets across species (Usadel et al., 2009). To provide up-front support in the initial processing of raw data from arrays, we programmed ROBIN, and ROBINA, a graphical user interface tool to streamline microarray and RNASeq data evaluation(Lohse et al., 2010; Lohse et al., 2013).

We are currently extending these tools to include reactions into the annotations as well, to be able to get a direct insight into metabolic capabilities.

Our goal is to be able to generate novel biological hypotheses from omics data with the help of mathematical and computer science tools. Ideally, such approaches when applied to e.g. nutrient starvation samples would aid in identifying novel candidate genes which could be tested by e.g. knock out T-DNA lines. We had previously shown that correlative approaches can be used for this purpose (in collaboration with Dr. Steinhauser et al., 2004). Furthermore we have used correlation, partial correlation and simple regression based methods to investigate e.g. the response to multiple perturbations of central metabolism in Arabidopsis accessions (Sulpice et al., 2009) or in seed development (Vasilevski et al., 2012). Additionally we used regression techniques tobuild a predictive model integrating inputs from the clock, light and C signalling during diurnal cycles (Usadel et al., 2008).

In extending these approaches, our initial approach was mainly network centric and focused on Bayesian and (partial) correlation based network reconstruction. However, whilst working with real world transcriptomic data, we uncovered an artefact which introduced spurious correlations when using the popular RMA (Robust Multi Array) algorithm. This artefact resulted from a too strong normalization between arrays in certain cases. As a prerequisite step for network building, we therefore carefully modified the RMA algorithm, maintaining many of its beneficial properties but removing most of the artificial correlation that is introduced by this normalization (Giorgi et al., 2010). After having also improved the first steps in array and NGS analysis (see above, ROBIN, ROBINA, Trimmomatic), we revisited network reconstruction approaches.

When benchmarking network rebuilding algorithms we were faced with the problem that those algorithms which perform well with respect to synthetic benchmarks did not seem to yield the most biologically meaningful networks (as judged by e.g. functional categories). We therefore modified our benchmarking strategy to focus on real plant omics data, resorting to cross-validation techniques to assess robustness of the reconstructions. We combined cross-validation with inherent variable selection techniques (e.g. the lasso, Vasilveski et al., 2012). Using this we can now restrict connections between nodes. We are now using reverse genetics to assess associations between metabolites and transcripts predicted by the lasso.

Furthermore, we incorporated most of our approaches into a first prototype of a desktop software solution to enable computer-savvy biologists to perform these analyses on their own data. This should allow us to improve our system, based on the feedback to be received. As a side project we were able to predict subcellular localization based on gene expression (Ryngajllo et al., 2011).

We aim to further improve these methods and develop novel regression based approaches taking previous knowledge into account. We will continue working on a combination of lasso and elastic net variable selection and partial correlation techniques to orient network edges to provide potentially causal links, in the absence of time series data. We apply this now in gene finding together with our genome and transcriptome sequencing expertise.

Here a new focus is on

Valuable plant products 

Mostly within the BioSC and our colleagues from the Research Center Jülich IBG-1, the Heinrich Heine University and the Institute for Biotechnology at the RWTH Aachen we identify genes involved in valuable plant metabolites.