As an evolutionary biologist and a bioinformatician, I'm interested in using bioinformatics approaches to study the evolution of genes, genomes, and organisms. Regarding genome structure and evolution (which forms a major part of our research), I'm particularly interested in the study of gene and genome duplications as well as in the evolution of novel gene functions after duplication. Gene and genome duplication events have been considered important mechanisms for increasing biological complexity or evolving novelty in biology. However, controversy still exists about how and how fast duplicated genes evolve new functions and on the importance of whole genome duplications. Although the number of sequence data that can provide us with answers to the significance of gene and genome duplication, mapping and interpreting (large-scale) gene duplication events remains difficult. We believe that whole genome duplications are often an evolutionary dead end, except under certain circumstances, for instance under times of environmental upheaval or changing environmental conditions. For a more comprehensive overview of our research interests, please see our research section.

Yves Van de Peer (YVdP) obtained his PhD in 1996 at the University of Antwerp, Belgium. After a postdoctoral fellowship with Axel Meyer at the University of Konstanz, Germany, he was hired at Ghent University (BE) as Group Leader of VIB (Department of Plant Systems Biology) in 2000 and as an Associate Professor at Ghent University in 2001, and promoted to Full Professor in 2008. YVdP’s research group is considered a genome analysis powerhouse specialized in the study of the structure and evolution of (plant) genomes. Because of their unique expertise and experience in gene prediction, genome annotation, and genome analysis, his research group has been, and still is, involved in many international genome projects.

YVdP is particularly interested in the study of gene and genome duplications as well as in the evolution of novel gene functions after duplication. YVdP published more than 450 papers, many of which in high-profile journals such as Nature, Nature Genetics, Nature Reviews Genetics, Science, PNAS, Genome Research, and The Plant Cell. YVdP has an H-index > 100 and his work has been cited more than 60,000 times. For many consecutive years, YVdP has been a Highly Cited Researcher. In 2013, YVdP received an ERC Advanced Grant entitled “DOUBLE-UP: The evolutionary significance of genome duplications for natural and artificial organism populations”, and in 2018 another one entitled “DOUBLE-TROUBLE: Replaying the ‘genome duplication’ tape of life: the adaptive potential of polyploidy in a stressful or changing environment”. YVdP is Organizer and Chair of the bi-annual international Current Opinion Conference on Plant Genome Evolution. This meeting was held in 2011, 2013, 2015, 2017, and 2019. In 2019, YVdP also organized the triannual International Conference on Polyploidy, Ghent, Belgium. YVdP is a member of the Royal Flemish Academy of Belgium for Science and the Arts (KVAB; since 2012) and serves on the Editorial Boards of five international journals (The Plant Journal, PeerJ, Genome Biology and Evolution, Current Plant Biology, Frontiers in Genetics). YVdP is also part-time professor at the Department of Biochemistry, Genetics and Microbiology, at the University of Pretoria, South Africa, and at the College of Horticulture at Nanjing Agricultural University, China.


  1. Rifkin, R. F., Vikram, S., Ramond, J.-B., Rey-Iglesia, A., Brand, T. B., Porraz, G., … Hansen, A. J. (2020). Multi-proxy analyses of a mid-15th century Middle Iron Age Bantu-speaker palaeo-faecal specimen elucidates the configuration of the “ancestral” sub-Saharan African intestinal microbiome. MICROBIOME, 8(1). https://doi.org/10.1186/s40168-020-00832-x
    Background The archaeological incidence of ancient human faecal material provides a rare opportunity to explore the taxonomic composition and metabolic capacity of the ancestral human intestinal microbiome (IM). Here, we report the results of the shotgun metagenomic analyses of an ancient South African palaeo-faecal specimen. Methods Following the recovery of a single desiccated palaeo-faecal specimen from Bushman Rock Shelter in Limpopo Province, South Africa, we applied a multi-proxy analytical protocol to the sample. The extraction of ancient DNA from the specimen and its subsequent shotgun metagenomic sequencing facilitated the taxonomic and metabolic characterisation of this ancient human IM. Results Our results indicate that the distal IM of the Neolithic 'Middle Iron Age' (c. AD 1460) Bantu-speaking individual exhibits features indicative of a largely mixed forager-agro-pastoralist diet. Subsequent comparison with the IMs of the Tyrolean Iceman (otzi) and contemporary Hadza hunter-gatherers, Malawian agro-pastoralists and Italians reveals that this IM precedes recent adaptation to 'Western' diets, including the consumption of coffee, tea, chocolate, citrus and soy, and the use of antibiotics, analgesics and also exposure to various toxic environmental pollutants. Conclusions Our analyses reveal some of the causes and means by which current human IMs are likely to have responded to recent dietary changes, prescription medications and environmental pollutants, providing rare insight into human IM evolution following the advent of the Neolithic c. 12,000 years ago.
  2. Roelofs, D., Zwaenepoel, A., Sistermans, T., Nap, J., Kampfraath, A. A., Van de Peer, Y., … Kraaijeveld, K. (2020). Multi-faceted analysis provides little evidence for recurrent whole-genome duplications during hexapod evolution. BMC BIOLOGY, 18. https://doi.org/10.1186/s12915-020-00789-1
    Background: Gene duplication events play an important role in the evolution and adaptation of organisms. Duplicated genes can arise through different mechanisms, including whole-genome duplications (WGDs). Recently, WGD was suggested to be an important driver of evolution, also in hexapod animals. Results: Here, we analyzed 20 high-quality hexapod genomes using whole-paranome distributions of estimated synonymous distances (KS), patterns of within-genome co-linearity, and phylogenomic gene tree-species tree reconciliation methods. We observe an abundance of gene duplicates in the majority of these hexapod genomes, yet we find little evidence for WGD. The majority of gene duplicates seem to have originated through small-scale gene duplication processes. We did detect segmental duplications in six genomes, but these lacked the within-genome co-linearity signature typically associated with WGD, and the age of these duplications did not coincide with particular peaks in KS distributions. Furthermore, statistical gene tree-species tree reconciliation failed to support all but one of the previously hypothesized WGDs. Conclusions: Our analyses therefore provide very limited evidence for WGD having played a significant role in the evolution of hexapods and suggest that alternative mechanisms drive gene duplication events in this group of animals. For instance, we propose that, along with small-scale gene duplication events, episodes of increased transposable element activity could have been an important source for gene duplicates in hexapods.
  3. Bezuidt, O. K. I., Lebre, P. H., Pierneef, R., León-Sobrino, C., Adriaenssens, E. M., Cowan, D. A., … Makhalanyane, T. P. (2020). Phages actively challenge niche communities in Antarctic soils. MSYSTEMS, 5(3). https://doi.org/10.1128/msystems.00234-20
    By modulating the structure, diversity, and trophic outputs of microbial communities, phages play crucial roles in many biomes. In oligotrophic polar deserts, the effects of katabatic winds, constrained nutrients, and low water availability are known to limit microbial activity. Although phages may substantially govern trophic interactions in cold deserts, relatively little is known regarding the precise ecological mechanisms. Here, we provide the first evidence of widespread antiphage innate immunity in Antarctic environments using metagenomic sequence data from hypolith communities as model systems. In particular, immunity systems such as DISARM and BREX are shown to be dominant systems in these communities. Additionally, we show a direct correlation between the CRISPR-Cas adaptive immunity and the metavirome of hypolith communities, suggesting the existence of dynamic host-phage interactions. In addition to providing the first exploration of immune systems in cold deserts, our results suggest that phages actively challenge niche communities in Antarctic polar deserts. We provide evidence suggesting that the regulatory role played by phages in this system is an important determinant of bacterial host interactions in this environment. IMPORTANCE In Antarctic environments, the combination of both abiotic and biotic stressors results in simple trophic levels dominated by microbiomes. Although the past two decades have revealed substantial insights regarding the diversity and structure of microbiomes, we lack mechanistic insights regarding community interactions and how phages may affect these. By providing the first evidence of widespread antiphage innate immunity, we shed light on phage-host dynamics in Antarctic niche communities. Our analyses reveal several antiphage defense systems, including DISARM and BREX, which appear to dominate in cold desert niche communities. In contrast, our analyses revealed that genes which encode antiphage adaptive immunity were underrepresented in these communities, suggesting lower infection frequencies in cold edaphic environments. We propose that by actively challenging niche communities, phages play crucial roles in the diversification of Antarctic communities.
  4. Chen, Y.-C., Li, Z., Zhao, Y.-X., Gao, M., Wang, J.-Y., Liu, K.-W., … Wang, Y.-D. (2020). The Litsea genome and the evolution of the laurel family. NATURE COMMUNICATIONS, 11.
    The laurel family within the Magnoliids has attracted attentions owing to its scents, variable inflorescences, and controversial phylogenetic position. Here, we present a chromosome-level assembly of the Litsea cubeba genome, together with low-coverage genomic and transcriptomic data for many other Lauraceae. Phylogenomic analyses show phylogenetic discordance at the position of Magnoliids, suggesting incomplete lineage sorting during the divergence of monocots, eudicots, and Magnoliids. An ancient whole-genome duplication (WGD) event occurred just before the divergence of Laurales and Magnoliales; subsequently, independent WGDs occurred almost simultaneously in the three Lauralean lineages. The phylogenetic relationships within Lauraceae correspond to the divergence of inflorescences, as evidenced by the phylogeny of FUWA, a conserved gene involved in determining panicle architecture in Lauraceae. Monoterpene synthases responsible for production of specific volatile compounds in Lauraceae are functionally verified. Our work sheds light on the evolution of the Lauraceae, the genetic basis for floral evolution and specific scents.
  5. Pu, X., Li, Z., Tian, Y., Gao, R., Hao, L., Hu, Y., … Song, J. (2020). The honeysuckle genome provides insight into the molecular mechanism of carotenoid metabolism underlying dynamic flower coloration. NEW PHYTOLOGIST, 227(3), 930–943. https://doi.org/10.1111/nph.16552
    Lonicera japonica is a wide-spread member of the Caprifoliaceae (honeysuckle) family utilized in traditional medical practices. This twining vine honeysuckle is also a much-sought ornamental, in part due to its dynamic flower coloration, which changes from white to gold during development. The molecular mechanism underlying dynamic flower coloration in L. japonica was elucidated by integrating whole genome sequencing, transcriptomic analysis, and biochemical assays. Here, we report a chromosome-level genome assembly of L. japonica, comprising nine pseudo-chromosomes with a total size of 843.2 Mb. We also provide evidence for a whole genome duplication event in the lineage leading to L. japonica, which occurred after its divergence from Dipsacales and Asterales. Moreover, gene expression analysis not only revealed correlated expression of the relevant biosynthetic genes with carotenoid accumulation, but also suggested a role for carotenoid degradation in L. japonica's dynamic flower coloration. The variation of flower color is consistent with not only the observed carotenoid accumulation pattern, but also with the release of volatile apocarotenoids that presumably serve as pollinator attractants. Beyond novel insights into the evolution and dynamics of flower coloration, the high-quality L. japonica genome sequence also provides a foundation for molecular breeding to improve desired characteristics.
  6. Yau, S., Krasovec, M., Benites, L. F., Rombauts, S., Groussin, M., Vancaester, E., … Piganeau, G. (2020). Virus-host coexistence in phytoplankton through the genomic lens. SCIENCE ADVANCES, 6(14).
    Virus-microbe interactions in the ocean are commonly described by "boom and bust" dynamics, whereby a numerically dominant microorganism is lysed and replaced by a virus-resistant one. Here, we isolated a microalga strain and its infective dsDNA virus whose dynamics are characterized instead by parallel growth of both the microalga and the virus. Experimental evolution of clonal lines revealed that this viral production originates from the lysis of a minority of virus-susceptible cells, which are regenerated from resistant cells. Whole-genome sequencing demonstrated that this resistant-susceptible switch involved a large deletion on one chromosome. Mathematical modeling explained how the switch maintains stable microalga-virus population dynamics consistent with their observed growth pattern. Comparative genomics confirmed an ancient origin of this "accordion" chromosome despite a lack of sequence conservation. Together, our results show how dynamic genomic rearrangements may account for a previously overlooked coexistence mechanism in microalgae-virus interactions.
  7. Welgemoed, T., Pierneef, R., Sterck, L., Van de Peer, Y., Swart, V., Scheepers, K. D., & Berger, D. K. (2020). De novo assembly of transcriptomes from a B73 maize line introgressed with a QTL for resistance to gray leaf spot disease reveals a candidate allele of a lectin receptor-like kinase. FRONTIERS IN PLANT SCIENCE, 11. https://doi.org/10.3389/fpls.2020.00191
    Gray leaf spot (GLS) disease in maize, caused by the fungus Cercospora zeina, is a threat to maize production globally. Understanding the molecular basis for quantitative resistance to GLS is therefore important for food security. We developed a de novo assembly pipeline to identify candidate maize resistance genes. Near-isogenic maize lines with and without a QTL for GLS resistance on chromosome 10 from inbred CML444 were produced in the inbred B73 background. The B73-QTL line showed a 20% reduction in GLS disease symptoms compared to B73 in the field (p = 0.01). B73-QTL leaf samples from this field experiment conducted under GLS disease pressure were RNA sequenced. The reads that did not map to the B73 or C. zeina genomes were expected to contain novel defense genes and were de novo assembled. A total of 141 protein-coding sequences with B73-like or plant annotations were identified from the B73-QTL plants exposed to C. zeina. To determine whether candidate gene expression was induced by C. zeina, the RNAseq reads from C. zeina-challenged and control leaves were mapped to a master assembly of all of the B73-QTL reads, and differential gene expression analysis was conducted. Combining results from both bioinformatics approaches led to the identification of a likely candidate gene, which was a novel allele of a lectin receptor-like kinase named L-RLK-CML that (i) was induced by C. zeina, (ii) was positioned in the QTL region, and (iii) had functional domains for pathogen perception and defense signal transduction. The 817AA L-RLK-CML protein had 53 amino acid differences from its 818AA counterpart in B73. A second "B73-like" allele of L-RLK was expressed at a low level in B73-QTL. Gene copy-specific RT-qPCR confirmed that the l-rlk-cml transcript was the major product induced four-fold by C. zeina. Several other expressed defense-related candidates were identified, including a wall-associated kinase, two glutathione s-transferases, a chitinase, a glucan beta-glucosidase, a plasmodesmata callose-binding protein, several other receptor-like kinases, and components of calcium signaling, vesicular trafficking, and ethylene biosynthesis. This work presents a bioinformatics protocol for gene discovery from de novo assembled transcriptomes and identifies candidate quantitative resistance genes.
  8. Zwaenepoel, A., & Van de Peer, Y. (2020). Model-based detection of whole-genome duplications in a phylogeny. MOLECULAR BIOLOGY AND EVOLUTION.
    Ancient whole-genome duplications (WGDs) leave signatures in comparative genomic data sets that can be harnessed to detect these events of presumed evolutionary importance. Current statistical approaches for the detection of ancient WGDs in a phylogenetic context have two main drawbacks. The first is that unwarranted restrictive assumptions on the ‘background’ gene duplication and loss rates make inferences unreliable in the face of model violations. The second is that most methods can only be used to examine a limited set of a priori selected WGD hypotheses; and cannot be used to discover WGDs in a phylogeny. In this study we develop an approach for WGD inference using gene count data that seeks to overcome both issues. We employ a phylogenetic birth-death model that includes WGD in a flexible hierarchical Bayesian approach, and use reversible-jump MCMC to perform Bayesian inference of branch-specific duplication, loss and WGD retention rates accross the space of WGD configurations. We evaluate the proposed method using simulations, apply it to data sets from flowering plants and discuss the statistical intricacies of model-based WGD inference.
  9. Shi, T., Rahmani, R. S., Gugger, P. F., Wang, M., Li, H., Zhang, Y., … Chen, J. (2020). Distinct expression and methylation patterns for genes with different fates following a single whole-genome duplication in flowering plants. MOLECULAR BIOLOGY AND EVOLUTION.
    For most sequenced flowering plants, multiple whole-genome duplications (WGDs) are found. Duplicated genes following WGD often have different fates that can quickly disappear again, be retained for long(er) periods, or subsequently undergo small-scale duplications. However, how different expression, epigenetic regulation, and functional constraints are associated with these different gene fates following a WGD still requires further investigation due to successive WGDs in angiosperms complicating the gene trajectories. In this study, we investigate lotus (Nelumbo nucifera), an angiosperm with a single WGD during the K–pg boundary. Based on improved intraspecific-synteny identification by a chromosome-level assembly, transcriptome, and bisulfite sequencing, we explore not only the fundamental distinctions in genomic features, expression, and methylation patterns of genes with different fates after a WGD but also the factors that shape post-WGD expression divergence and expression bias between duplicates. We found that after a WGD genes that returned to single copies show the highest levels and breadth of expression, gene body methylation, and intron numbers, whereas the long-retained duplicates exhibit the highest degrees of protein–protein interactions and protein lengths and the lowest methylation in gene flanking regions. For those long-retained duplicate pairs, the degree of expression divergence correlates with their sequence divergence, degree in protein–protein interactions, and expression level, whereas their biases in expression level reflecting subgenome dominance are associated with the bias of subgenome fractionation. Overall, our study on the paleopolyploid nature of lotus highlights the impact of different functional constraints on gene fate and duplicate divergence following a single WGD in plant.
  10. Novikova, P., Brennan, I. G., Booker, W., Mahony, M., Doughty, P., Lemmon, A. R., … Donnellan, S. C. (2020). Polyploidy breaks speciation barriers in Australian burrowing frogs Neobatrachus. PLOS GENETICS, 16(5). https://doi.org/10.1371/journal.pgen.1008769
    Polyploidy has played an important role in evolution across the tree of life but it is still unclear how polyploid lineages may persist after their initial formation. While both common and well-studied in plants, polyploidy is rare in animals and generally less understood. The Australian burrowing frog genus Neobatrachus is comprised of six diploid and three polyploid species and offers a powerful animal polyploid model system. We generated exome-capture sequence data from 87 individuals representing all nine species of Neobatrachus to investigate species-level relationships, the origin and inheritance mode of polyploid species, and the population genomic effects of polyploidy on genus-wide demography. We describe rapid speciation of diploid Neobatrachus species and show that the three independently originated polyploid species have tetrasomic or mixed inheritance. We document higher genetic diversity in tetraploids, resulting from widespread gene flow between the tetraploids, asymmetric inter-ploidy gene flow directed from sympatric diploids to tetraploids, and isolation of diploid species from each other. We also constructed models of ecologically suitable areas for each species to investigate the impact of climate on differing ploidy levels. These models suggest substantial change in suitable areas compared to past climate, which correspond to population genomic estimates of demographic histories. We propose that Neobatrachus diploids may be suffering the early genomic impacts of climate-induced habitat loss, while tetraploids appear to be avoiding this fate, possibly due to widespread gene flow. Finally, we demonstrate that Neobatrachus is an attractive model to study the effects of ploidy on the evolution of adaptation in animals.
  11. Fox, D. T., Soltis, D. E., Soltis, P. S., Ashman, T.-L., & Van de Peer, Y. (2020). Polyploidy : a biological force from cells to ecosystems. TRENDS IN CELL BIOLOGY. https://doi.org/10.1016/j.tcb.2020.06.006
    Polyploidy, resulting from the duplication of the entire genome of an organism or cell, greatly affects genes and genomes, cells and tissues, organisms, and even entire ecosystems. Despite the wide-reaching importance of polyploidy, communication across disciplinary boundaries to identify common themes at different scales has been almost nonexistent. However, a critical need remains to understand commonalities that derive from shared polyploid cellular processes across organismal diversity, levels of biological organization, and fields of inquiry – from biodiversity and biocomplexity to medicine and agriculture. Here, we review the current understanding of polyploidy at the organismal and suborganismal levels, identify shared research themes and elements, and propose new directions to integrate research on polyploidy toward confronting interdisciplinary grand challenges of the 21st century.
  12. Tang, H., Zhang, L., Chen, F., Zhang, X., Chen, F., Ma, H., & Van de Peer, Y. (2020). Nymphaea colorata (Blue-petal water lily). TRENDS IN GENETICS. https://doi.org/10.1016/j.tig.2020.06.004
  13. Li, L., Wang, S., Wang, H., Sahu, S. K., Marin, B., Li, H., … Liu, H. (2020). The genome of Prasinoderma coloniale unveils the existence of a third phylum within green plants. NATURE ECOLOGY & EVOLUTION. https://doi.org/10.1038/s41559-020-1221-7
    Genome analysis of the pico-eukaryotic marine green algaPrasinoderma colonialeCCMP 1413 unveils the existence of a novel phylum within green plants (Viridiplantae), the Prasinodermophyta, which diverged before the split of Chlorophyta and Streptophyta. Structural features of the genome and gene family comparisons revealed an intermediate position of theP. colonialegenome (25.3 Mb) between the extremely compact, small genomes of picoplanktonic Mamiellophyceae (Chlorophyta) and the larger, more complex genomes of early-diverging streptophyte algae. Reconstruction of the minimal core genome of Viridiplantae allowed identification of an ancestral toolkit of transcription factors and flagellar proteins. Adaptations ofP. colonialeto its deep-water, oligotrophic environment involved expansion of light-harvesting proteins, reduction of early light-induced proteins, evolution of a distinct type of C(4)photosynthesis and carbon-concentrating mechanism, synthesis of the metal-complexing metabolite picolinic acid, and vitamin B-1, B(7)and B(12)auxotrophy. TheP. colonialegenome provides first insights into the dawn of green plant evolution. Genome analysis of the pico-eukaryotic marine green algaPrasinoderma colonialeCCMP 1413 unveils the existence of a novel phylum within green plants (Viridiplantae), the Prasinodermophyta, which diverged before the split of Chlorophyta and Streptophyta.
  14. Zhang, L., Chen, F., Zhang, X., Li, Z., Zhao, Y., Lohaus, R., … Tang, H. (2020). The water lily genome and the early evolution of flowering plants. NATURE, 577(7788), 79–84.
    Water lilies belong to the angiosperm order Nymphaeales. Amborellales, Nymphaeales and Austrobaileyales together form the so-called ANA-grade of angiosperms, which are extant representatives of lineages that diverged the earliest from the lineage leading to the extant mesangiosperms1,2,3. Here we report the 409-megabase genome sequence of the blue-petal water lily (Nymphaea colorata). Our phylogenomic analyses support Amborellales and Nymphaeales as successive sister lineages to all other extant angiosperms. The N. colorata genome and 19 other water lily transcriptomes reveal a Nymphaealean whole-genome duplication event, which is shared by Nymphaeaceae and possibly Cabombaceae. Among the genes retained from this whole-genome duplication are homologues of genes that regulate flowering transition and flower development. The broad expression of homologues of floral ABCE genes in N. colorata might support a similarly broadly active ancestral ABCE model of floral organ determination in early angiosperms. Water lilies have evolved attractive floral scents and colours, which are features shared with mesangiosperms, and we identified their putative biosynthetic genes in N. colorata. The chemical compounds and biosynthetic genes behind floral scents suggest that they have evolved in parallel to those in mesangiosperms. Because of its unique phylogenetic position, the N. colorata genome sheds light on the early evolution of angiosperms.
  15. Zhang, J., Fu, X.-X., Li, R.-Q., Zhao, X., Liu, Y., Li, M.-H., … Chen, Z.-D. (2020). The hornwort genome and early land plant evolution. NATURE PLANTS, 6(2), 107–118. https://doi.org/10.1038/s41477-019-0588-4
    Hornworts, liverworts and mosses are three early diverging clades of land plants, and together comprise the bryophytes. Here, we report the draft genome sequence of the hornwort Anthoceros angustus. Phylogenomic inferences confirm the monophyly of bryophytes, with hornworts sister to liverworts and mosses. The simple morphology of hornworts correlates with low genetic redundancy in plant body plan, while the basic transcriptional regulation toolkit for plant development has already been established in this early land plant lineage. Although the Anthoceros genome is small and characterized by minimal redundancy, expansions are observed in gene families related to RNA editing, UV protection and desiccation tolerance. The genome of A. angustus bears the signatures of horizontally transferred genes from bacteria and fungi, in particular of genes operating in stress-response and metabolic pathways. Our study provides insight into the unique features of hornworts and their molecular adaptations to live on land.
  16. Sahu, S. K., Liu, M., Yssel, A., Kariba, R., Muthemba, S., Jiang, S., … Liu, H. (2020). Draft genomes of two Artocarpus plants, jackfruit (A. heterophyllus) and breadfruit (A. altilis). GENES, 11(1).
    Two of the most economically important plants in the Artocarpus genus are jackfruit (A. heterophyllus Lam.) and breadfruit (A. altilis (Parkinson) Fosberg). Both species are long-lived trees that have been cultivated for thousands of years in their native regions. Today they are grown throughout tropical to subtropical areas as an important source of starch and other valuable nutrients. There are hundreds of breadfruit varieties that are native to Oceania, of which the most commonly distributed types are seedless triploids. Jackfruit is likely native to the Western Ghats of India and produces one of the largest tree-borne fruit structures (reaching up to 45 kg). To-date, there is limited genomic information for these two economically important species. Here, we generated 273 Gb and 227 Gb of raw data from jackfruit and breadfruit, respectively. The high-quality reads from jackfruit were assembled into 162,440 scaffolds totaling 982 Mb with 35,858 genes. Similarly, the breadfruit reads were assembled into 180,971 scaffolds totaling 833 Mb with 34,010 genes. A total of 2822 and 2034 expanded gene families were found in jackfruit and breadfruit, respectively, enriched in pathways including starch and sucrose metabolism, photosynthesis, and others. The copy number of several starch synthesis-related genes were found to be increased in jackfruit and breadfruit compared to closely-related species, and the tissue-specific expression might imply their sugar-rich and starch-rich characteristics. Overall, the publication of high-quality genomes for jackfruit and breadfruit provides information about their specific composition and the underlying genes involved in sugar and starch metabolism.
  17. Wang, S., Li, L., Li, H., Sahu, S. K., Wang, H., Xu, Y., … Liu, X. (2020). Genomes of early-diverging streptophyte algae shed light on plant terrestrialization. NATURE PLANTS, 6(2), 95–106.
    Mounting evidence suggests that terrestrialization of plants started in streptophyte green algae, favoured by their dual existence in freshwater and subaerial/terrestrial environments. Here, we present the genomes of Mesostigma viride and Chlorokybus atmophyticus, two sister taxa in the earliest-diverging clade of streptophyte algae dwelling in freshwater and subaerial/terrestrial environments, respectively. We provide evidence that the common ancestor of M. viride and C. atmophyticus (and thus of streptophytes) had already developed traits associated with a subaerial/terrestrial environment, such as embryophyte-type photorespiration, canonical plant phytochrome, several phytohormones and transcription factors involved in responses to environmental stresses, and evolution of cellulose synthase and cellulose synthase-like genes characteristic of embryophytes. Both genomes differed markedly in genome size and structure, and in gene family composition, revealing their dynamic nature, presumably in response to adaptations to their contrasting environments. The ancestor of M. viride possibly lost several genomic traits associated with a subaerial/terrestrial environment following transition to a freshwater habitat.
  18. Li, Z., & Van de Peer, Y. (2020). ’Winter is coming’ : how did polyploid plants survive? MOLECULAR PLANT, 13(1), 4–5.
  19. Wong, G. K.-S., Soltis, D. E., Leebens-Mack, J., Wickett, N. J., Barker, M. S., Van de Peer, Y., … Melkonian, M. (2020). Sequencing and analyzing the transcriptomes of a thousand species across the tree of life for green plants. ANNUAL REVIEW OF PLANT BIOLOGY, 71, 741–765. https://doi.org/10.1146/annurev-arplant-042916-041040
    The 1,000 Plants (1KP) initiative was the first large-scale effort to collect next-generation sequencing (NGS) data across a phylogenetically representative sampling of species for a major clade of life, in this case theViridiplantae, or green plants. As an international multidisciplinary consortium, we focused on plant evolution and its practical implications. Among the major outcomes were the inference of a reference species tree for green plants by phylotranscriptomic analysis of low-copy genes, a survey of paleopolyploidy (whole-genome duplications) across the Viridiplantae, the inferred evolutionary histories for many gene families and biological processes, the discovery of novel light-sensitive proteins for optogenetic studies in mammalian neuroscience, and elucidation of the genetic network for a complex trait (C4 photosynthesis). Altogether, 1KP demonstrated how value can be extracted from a phylodiverse sequencing data set, providing a template for future projects that aim to generate even more data, including complete de novo genomes, across the tree of life.
  20. Verlinden, H., Sterck, L., Li, J., Li, Z., Yssel, A., Gansemans, Y., … Vanden Broeck, J. (2020). First draft genome assembly of the desert locust, Schistocerca gregaria. F1000Research, 9. https://doi.org/10.12688/f1000research.25148.1
    Background: At the time of publication, the most devastating desert locust crisis in decades is affecting East Africa, the Arabian Peninsula and South-West Asia. The situation is extremely alarming in East Africa, where Kenya, Ethiopia and Somalia face an unprecedented threat to food security and livelihoods. Most of the time, however, locusts do not occur in swarms, but live as relatively harmless solitary insects. The phenotypically distinct solitarious and gregarious locust phases differ markedly in many aspects of behaviour, physiology and morphology, making them an excellent model to study how environmental factors shape behaviour and development. A better understanding of the extreme phenotypic plasticity in desert locusts will offer new, more environmentally sustainable ways of fighting devastating swarms. Methods: High molecular weight DNA derived from two adult males was used for Mate Pair and Paired End Illumina sequencing and PacBio sequencing. A reliable reference genome of Schistocerca gregaria was assembled using the ABySS pipeline, scaffolding was improved using LINKS. Results: In total, 1,316 Gb Illumina reads and 112 Gb PacBio reads were produced and assembled. The resulting draft genome consists of 8,817,834,205 bp organised in 955,015 scaffolds with an N50 of 157,705 bp, making the desert locust genome the largest insect genome sequenced and assembled to date. In total, 18,815 protein-encoding genes are predicted in the desert locust genome, of which 13,646 (72.53%) obtained at least one functional assignment based on similarity to known proteins. Conclusions: The desert locust genome data will contribute greatly to studies of phenotypic plasticity, physiology, neurobiology, molecular ecology, evolutionary genetics and comparative genomics, and will promote the desert locust’s use as a model system. The data will also facilitate the development of novel, more sustainable strategies for preventing or combating swarms of these infamous insects.
  21. Navia, D., Novelli, V. M., Rombauts, S., Freitas-Astua, J., de Mendonca, R. S., Nunes, M. A., … Van de Peer, Y. (2019). Draft genome assembly of the false spider mite Brevipalpus yothersi. MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 8(6).
    The false spider mite Brevipalpus yothersi infests a broad host plant range and has become one of the most economically important species within the genus Brevipalpus. This phytophagous mite inflicts damage by both feeding on plants and transmitting plant viruses. Here, we report the first draft genome sequence of the false spider mite, which is also the first plant virus mite vector to be sequenced. The similar to 72 Mb genome (sequenced at 42x coverage) encodes similar to 16,000 predicted protein-coding genes.
  22. Zwaenepoel, A., & Van de Peer, Y. (2019). Inference of ancient whole-genome duplications and the evolution of gene duplication and loss rates. MOLECULAR BIOLOGY AND EVOLUTION, 36(7), 1384–1404.
    Gene tree - species tree reconciliation methods have been employed for studying ancient whole genome duplication (WGD) events across the eukaryotic tree of life. Most approaches have relied on using maximum likelihood trees and the maximum parsimony reconciliation thereof to count duplication events on specific branches of interest in a reference species tree. Such approaches do not account for uncertainty in the gene tree and reconciliation, or do so only heuristically. The effects of these simplifications on the inference of ancient WGDs are unclear. In particular the effects of variation in gene duplication and loss rates across the species tree have not been considered. Here, we developed a full probabilistic approach for phylogenomic reconciliation based WGD inference, accounting for both gene tree and reconciliation uncertainty using a method based on the principle of amalgamated likelihood estimation. The model and methods are implemented in a maximum likelihood and Bayesian setting and account for variation of duplication and loss rate across the species tree, using methods inspired by phylogenetic divergence time estimation. We applied our newly developed framework to ancient WGDs in land plants and investigate the effects of duplication and loss rate variation on reconciliation and gene count based assessment of these earlier proposed WGDs.
  23. Xu, C.-Q., Liu, H., Zhou, S.-S., Zhang, D.-X., Zhao, W., Wang, S., … Mao, J.-F. (2019). Genome sequence of Malania oleifera, a tree with great value for nervonic acid production. GIGASCIENCE, 8(2).
    BACKGROUND: Malania oleifera, a member of the Olacaceae family, is an IUCN Red Listed tree, endemic and restricted to the Karst region of South West China. This tree's seed is valued for its high content of precious fatty acids (especially nervonic acid). However, studies on its genetic make-up, and fatty acid biogenesis are severely hampered by a lack of molecular and genetic tools. FINDINGS: We generated 51 Gigabases (Gb) and 135 Gb of raw DNA sequences, using PacBio Single-Molecule Real-Time (SMRT) and 10x Genomics sequencing, respectively. A final genome assembly, with a scaffold N50 size of 4.65 Megabases (Mb) and a total length of 1.51 Gb, was obtained by primary assembly based on PacBio long reads plus scaffolding with 10x Genomics reads. Identified repeats constituted ∼82% of the genome, and 24,064 protein-coding genes were predicted with high support. The genome has low heterozygosity and shows no evidence for recent whole genome duplication. Metabolic pathway genes relating to the accumulation of long chain fatty acid were identified and studied in detail. CONCLUSIONS: Here, we provide the first genome assembly and gene annotation for M. oleifera. The availability of these resources will be of great importance for conservation biology, and for the functional genomics of nervonic acid biosynthesis.
  24. Zwaenepoel, A., Li, Z., Lohaus, R., & Van de Peer, Y. (2019). Finding evidence for whole genome duplications : a reappraisal. MOLECULAR PLANT, 12(2), 133–136.
  25. Zwaenepoel, A., & Van de Peer, Y. (2019). wgd-simple command line tools for the analysis of ancient whole genome duplications. BIOINFORMATICS, 35(12), 2153–2155.
    MOTIVATION: Ancient whole genome duplications (WGDs) have been uncovered in almost all major lineages of life on Earth and the search for traces or remnants of such events has become standard practice in most genome analyses. This is especially true for plants, where ancient WGDs are abundant. Common approaches to find evidence for ancient WGDs include the construction of KS distributions and the analysis of intragenomic co-linearity. Despite the increased interest in WGDs and the acknowledgement of their evolutionary importance, user-friendly and comprehensive tools for their analysis are lacking. Here, we present an easy to use command-line tool for KS distribution construction named wgd. The wgd suite provides commonly used KS and co-linearity analysis workflows together with tools for modeling and visualization, rendering these analyses accessible to genomics researchers in a convenient manner. AVAILABILITY & IMPLEMENTATION: wgd is free and open source software implemented in Python and is available at https://github.com/arzwa/wgd. SUPPLEMENTARY INFORMATION: Supplementary methods are available at Bioinformatics online.
  26. Melckenbeeck, I., Audenaert, P., Van Parys, T., Van de Peer, Y., Colle, D., & Pickavet, M. (2019). Optimising orbit counting of arbitrary order by equation selection. BMC BIOINFORMATICS, 20.
    Background: Graphlets are useful for bioinformatics network analysis. Based on the structure of Hočevar and Demšar’s ORCA algorithm, we have created an orbit counting algorithm, named Jesse. This algorithm, like ORCA, uses equations to count the orbits, but unlike ORCA it can count graphlets of any order. To do so, it generates the required internal structures and equations automatically. Many more redundant equations are generated, however, and Jesse’s running time is highly dependent on which of these equations are used. Therefore, this paper aims to investigate which equations are most efficient, and which factors have an effect on this efficiency. Results: With appropriate equation selection, Jesse’s running time may be reduced by a factor of up to 2 in the best case, compared to using randomly selected equations. Which equations are most efficient depends on the density of the graph, but barely on the graph type. At low graph density, equations with terms in their right-hand side with few arguments are more efficient, whereas at high density, equations with terms with many arguments in the right-hand side are most efficient. At a density between 0.6 and 0.7, both types of equations are about equally efficient. Conclusions: Our Jesse algorithm became up to a factor 2 more efficient, by automatically selecting the best equations based on graph density. It was adapted into a Cytoscape App that is freely available from the Cytoscape App Store to ease application by bioinformaticians.
  27. Zhang, Ticao, Qiao, Q., Novikova, P., Wang, Q., Yue, J., Guan, Y., Ming, S., et al. (2019). Genome of Crucihimalaya himalaica, a close relative of Arabidopsis, shows ecological adaptation to high altitude. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 116(14), 7137–7146.
    Crucihimalaya himalaica, a close relative of Arabidopsis and Capsella, grows on the Qinghai–Tibet Plateau (QTP) about 4,000 m above sea level and represents an attractive model system for studying speciation and ecological adaptation in extreme environments. We assembled a draft genome sequence of 234.72 Mb encoding 27,019 genes and investigated its origin and adaptive evolutionary mechanisms. Phylogenomic analyses based on 4,586 single-copy genes revealed that C. himalaica is most closely related to Capsella (estimated divergence 8.8 to 12.2 Mya), whereas both species form a sister clade to Arabidopsis thaliana and Arabidopsis lyrata, from which they diverged between 12.7 and 17.2 Mya. LTR retrotransposons in C. himalaica proliferated shortly after the dramatic uplift and climatic change of the Himalayas from the Late Pliocene to Pleistocene. Compared with closely related species, C. himalaica showed significant contraction and pseudogenization in gene families associated with disease resistance and also significant expansion in gene families associated with ubiquitin-mediated proteolysis and DNA repair. We identified hundreds of genes involved in DNA repair, ubiquitin-mediated proteolysis, and reproductive processes with signs of positive selection. Gene families showing dramatic changes in size and genes showing signs of positive selection are likely candidates for C. himalaica’s adaptation to intense radiation, low temperature, and pathogen-depauperate environments in the QTP. Loss of function at the S-locus, the reason for the transition to self-fertilization of C. himalaica, might have enabled its QTP occupation. Overall, the genome sequence of C. himalaica provides insights into the mechanisms of plant adaptation to extreme environments.
  28. Felix, Carina, Silva Meneses, R., Goncalves, M. F. M., Tilleman, L., Duarte, A. S., Jorrin-Novo, J. V., Van de Peer, Y., et al. (2019). A multi-omics analysis of the grapevine pathogen Lasiodiplodia theobromae reveals that temperature affects the expression of virulence- and pathogenicity-related genes. SCIENTIFIC REPORTS, 9.
    Lasiodiplodia theobromae (Botryosphaeriaceae, Ascomycota) is a plant pathogen and human opportunist whose pathogenicity is modulated by temperature. The molecular effects of temperature on L. theobromae are mostly unknown, so we used a multi-omics approach to understand how temperature affects the molecular mechanisms of pathogenicity. The genome of L. theobromae LA-SOL3 was sequenced (Illumina MiSeq) and annotated. Furthermore, the transcriptome (Illumina TruSeq) and proteome (Orbitrap LC-MS/MS) of LA-SOL3 grown at 25 degrees C and 37 degrees C were analysed. Proteins related to pathogenicity (plant cell wall degradation, toxin synthesis, mitogen-activated kinases pathway and proteins involved in the velvet complex) were more abundant when the fungus grew at 25 degrees C. At 37 degrees C, proteins related to pathogenicity were less abundant than at 25 degrees C, while proteins related to cell wall organisation were more abundant. On the other hand, virulence factors involved in human pathogenesis, such as the SSD1 virulence protein, were expressed only at 37 degrees C. Taken together, our results showed that this species presents a typical phytopathogenic molecular profile that is compatible with a hemibiotrophic lifestyle. We showed that L. theobromae is equipped with the pathogenesis toolbox that enables it to infect not only plants but also animals.
  29. Defoort, J., Van de Peer, Y., & Carretero-Paulet, L. (2019). The evolution of gene duplicates in angiosperms and the impact of protein-protein interactions and the mechanism of duplication. GENOME BIOLOGY AND EVOLUTION, 11(8), 2292–2305.
    Gene duplicates, generated through either whole genome duplication (WGD) or small-scale duplication (SSD), are prominent in angiosperms and are believed to play an important role in adaptation and in generating evolutionary novelty. Previous studies reported contrasting evolutionary and functional dynamics of duplicate genes depending on the mechanism of origin, a behavior that is hypothesized to stem from constraints to maintain the relative dosage balance between the genes concerned and their interaction context. However, the mechanism ultimately influencing loss and retention of gene duplicates over evolutionary time are not yet fully elucidated. Here, by using a robust classification of gene duplicates in Arabidopsis thaliana, Solanumlycopersicum, and Zea mays, large RNAseq expression compendia and an extensive protein-protein interaction (PPI) network from Arabidopsis, we investigated the impact of PPIs on the differential evolutionary and functional fate ofWGD and SSD duplicates. In all three species, retained WGD duplicates show stronger constraints to diverge at the sequence and expression level than SSD ones, a pattern that is also observed for shared PPI partners between Arabidopsis duplicates. PPIs are preferentially distributed among WGD duplicates and specific functional categories. Furthermore, duplicates with PPIs tend to be under stronger constraints to evolve than their counterparts without PPIs regardless of their mechanism of origin. Our results support dosage balance constraint as a specific property of genes involved in biological interactions, including physical PPIs, and suggest that additional factors may be differently influencing the evolution of genes following duplication, depending on the species, time, and mechanism of origin.
  30. Burgess, S. T., Marr, E. J., Bartley, K., Nunn, F. G., Down, R. E., Weaver, R. J., … Nisbet, A. J. (2019). A genomic analysis and transcriptomic atlas of gene expression in Psoroptes ovis reveals feeding- and stage-specific patterns of allergen expression. BMC GENOMICS, 20.
    Background: Psoroptic mange, caused by infestation with the ectoparasitic mite, Psoroptes ovis, is highly contagious, resulting in intense pruritus and represents a major welfare and economic concern for the livestock industry Worldwide. Control relies on injectable endectocides and organophosphate dips, but concerns over residues, environmental contamination, and the development of resistance threaten the sustainability of this approach, highlighting interest in alternative control methods. However, development of vaccines and identification of chemotherapeutic targets is hampered by the lack of P. ovis transcriptomic and genomic resources. Results: Building on the recent publication of the P. ovis draft genome, here we present a genomic analysis and transcriptomic atlas of gene expression in P. ovis revealing feeding- and stage-specific patterns of gene expression, including novel multigene families and allergens. Network-based clustering revealed 14 gene clusters demonstrating either single- or multi-stage specific gene expression patterns, with 3075 female-specific, 890 male-specific and 112, 217 and 526 transcripts showing larval, protonymph and tritonymph specific-expression, respectively. Detailed analysis of P. ovis allergens revealed stage-specific patterns of allergen gene expression, many of which were also enriched in "fed" mites and tritonymphs, highlighting an important feeding-related allergenicity in this developmental stage. Pair-wise analysis of differential expression between life-cycle stages identified patterns of sex-biased gene expression and also identified novel P. ovis multigene families including known allergens and novel genes with high levels of stage-specific expression. Conclusions: The genomic and transcriptomic atlas described here represents a unique resource for the acarid-research community, whilst the OrcAE platform makes this freely available, facilitating further community-led curation of the draft P. ovis genome.
  31. Burgess, S. T., Marr, E. J., Bartley, K., Nunn, F. G., Down, R. E., Weaver, R. J., … Nisbet, A. J. (2019). A genomic analysis and transcriptomic atlas of gene expression in Psoroptes ovis reveals feeding- and stage-specific patterns of allergen expression. bioRxiv. Cold Spring Harbor Laboratory.
  32. Heydari, M., Miclotte, G., Van de Peer, Y., & Fostier, J. (2019). Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC BIOINFORMATICS, 20.
  33. Yao, Y., Carretero-Paulet, L., & Van de Peer, Y. (2019). Using digital organisms to study the evolutionary consequences of whole genome duplication and polyploidy. PLOS ONE, 14(7).
    The potential role of whole genome duplication (WGD) in evolution is controversial. Whereas some view WGD mainly as detrimental and an evolutionary 'dead end', there is growing evidence that the long-term establishment of polyploidy might be linked to environmental change, stressful conditions, or periods of extinction. However, despite much research, the mechanistic underpinnings of why and how polyploids might be able to outcompete non-polyploids at times of environmental upheaval remain indefinable. Here, we improved our recently developed bio-inspired framework, combining an artificial genome with an agent-based system, to form a population of so-called Digital Organisms (DOs), to examine the impact of WGD on evolution under different environmental scenarios mimicking extinction events of varying strength and frequency. We found that, under stable environments, DOs with non-duplicated genomes formed the majority, if not all, of the population, whereas the numbers of DOs with duplicated genomes increased under dramatically challenging environments. After tracking the evolutionary trajectories of individual genomes in terms of sequence and encoded gene regulatory networks (GRNs), we propose that duplicated GRNs might provide polyploids with better chances to acquire the drastic changes necessary to adapt to challenging conditions, thus endowing DOs with increased adaptive potential under extinction events. In contrast, under stable environments, random mutations might easily render the GRN less well adapted to such environments, a phenomenon that is exacerbated in duplicated, more complex GRNs. We believe that our results provide some additional insights into how genome duplication and polyploidy might help organisms to compete for novel niches and survive ecological turmoil, and confirm the usefulness of our computational simulation in studying the role of WGD in evolution and adaptation, helping to overcome some of the traditional limitations of evolution experiments with model organisms.
  34. Rodrigues, A. S., Chaves, I., Costa, B. V., Lin, Y.-C., Lopes, S., Milhinhos, A., Van de Peer, Y., et al. (2019). Small RNA profiling in Pinus pinaster reveals the transcriptome of developing seeds and highlights differences between zygotic and somatic embryos. SCIENTIFIC REPORTS, 9.
    Regulation of seed development by small non-coding RNAs (sRNAs) is an important mechanism controlling a crucial phase of the life cycle of seed plants. In this work, sRNAs from seed tissues (zygotic embryos and megagametophytes) and from somatic embryos of Pinus pinaster were analysed to identify putative regulators of seed/embryo development in conifers. In total, sixteen sRNA libraries covering several developmental stages were sequenced. We show that embryos and megagametophytes express a large population of 21-nt sRNAs and that substantial amounts of 24-nt sRNAs were also detected, especially in somatic embryos. A total of 215 conserved miRNAs, one third of which are conifer-specific, and 212 high-confidence novel miRNAs were annotated. MIR159, MIR171 and MIR394 families were found in embryos, but were greatly reduced in megagametophytes. Other families, like MIR397 and MIR408, predominated in somatic embryos and megagametophytes, suggesting their expression in somatic embryos is associated with in vitro conditions. Analysis of the predicted miRNA targets suggests that miRNA functions are relevant in several processes including transporter activity at the cotyledon-forming stage, and sulfur metabolism across several developmental stages. An important resource for studying conifer embryogenesis is made available here, which may also provide insightful clues for improving clonal propagation via somatic embryogenesis.
  35. Bossuyt, F., Schulte, L. M., Maex, M., Janssenswillen, S., Novikova, P., Biju, S., … Van Bocxlaer, I. (2019). Multiple independent recruitment of sodefrin precursor-like factors in anuran sexually dimorphic glands. MOLECULAR BIOLOGY AND EVOLUTION, 36(9), 1921–1930.
    Chemical signaling in animals often plays a central role in eliciting a variety of responses during reproductive interactions between males and females. One of the best-known vertebrate courtship pheromone systems is sodefrin precursor-like factors (SPFs), a family of two-domain three-finger proteins with a female-receptivity enhancing function, currently only known from salamanders. The oldest divergence between active components in a single salamander species dates back to the Late Paleozoic, indicating that these proteins potentially gained a pheromone function earlier in amphibian evolution. Here, we combined whole transcriptome sequencing, proteomics, histology, and molecular phylogenetics in a comparative approach to investigate SPF occurrence in male breeding glands across the evolutionary tree of anurans (frogs and toads). Our study shows that multiple families of both terrestrially and aquatically reproducing frogs have substantially increased expression levels of SPFs in male breeding glands. This suggests that multiple anuran lineages make use of SPFs to complement acoustic and visual sexual signaling during courtship. Comparative analyses show that anurans independently recruited these proteins each time the gland location on the male’s body allowed efficient transmission of the secretion to the female’s nares.
  36. Roodt, D., Li, Z., Van de Peer, Y., & Mizrachi, E. (2019). Loss of wood formation genes in monocot genomes. GENOME BIOLOGY AND EVOLUTION, 11(7), 1986–1996.
    Woodiness (secondary xylem derived from vascular cambium) has been gained and lost multiple times in the angiosperms, but has been lost ancestrally in all monocots. Here, we investigate the conservation of genes involved in xylogenesis in fully sequenced angiosperm genomes, hypothesising that monocots have lost some essential orthologs involved in this process. We analysed the conservation of genes preferentially expressed in the developing secondary xylem of two eudicot trees in the sequenced genomes of 26 eudicot and seven monocot species, and the early-diverging angiosperm Amborella trichopoda. We also reconstructed a regulatory model of early vascular cambial cell identity and differentiation and investigated the conservation of orthologs across the angiosperms. Additionally, we analysed the genome of the aquatic seagrass Zostera marina for additional losses of genes otherwise essential to, especially, secondary cell wall formation. Despite almost complete conservation of orthology within the early cambial differentiation gene network, we show a clear pattern of loss of genes preferentially expressed in secondary xylem in the monocots that are highly conserved across eudicot species. Our study provides candidate genes that may have led to the loss of vascular cambium in the monocots, and, by comparing terrestrial angiosperms to an aquatic monocot, highlights genes essential to vasculature on land.
  37. Meysman, P., Saeys, Y., Sabaghian, E., Bittremieux, W., Van de Peer, Y., Goethals, B., & Laukens, K. (2019). Mining the enriched subgraphs for specific vertices in a biological graph. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 16(5), 1496–1507.
    In this paper, we present a subgroup discovery method to find subgraphs in a graph that are associated with a given set of vertices. The association between a subgraph pattern and a set of vertices is defined by its significant enrichment based on a Bonferroni-corrected hypergeometric probability value. This interestingness measure requires a dedicated pruning procedure to limit the number of subgraph matches that must be calculated. The presented mining algorithm to find associated subgraph patterns in large graphs is therefore designed to efficiently traverse the search space. We demonstrate the operation of this method by applying it on three biological graph data sets and show that we can find associated subgraphs for a biologically relevant set of vertices and that the found subgraphs themselves are biologically interesting.
  38. Yssel, A. E., Kao, S.-M., Van de Peer, Y., & Sterck, L. (2019). ORCAE-AOCC : a centralized portal for the annotation of African orphan crop genomes. GENES, 10(12).
    ORCAE (Online Resource for Community Annotation of Eukaryotes) is a public genome annotation curation resource. ORCAE-AOCC is a branch that is dedicated to the genomes published as part of the African Orphan Crops Consortium (AOCC). The motivation behind the development of the ORCAE platform was to create a knowledge-based website where the research-community can make contributions to improve genome annotations. All changes to any given gene-model or gene description are stored, and the entire annotation history can be retrieved. Genomes can either be set to “public” or “restricted” mode; anonymous users can browse public genomes but cannot make any changes. Aside from providing a user- friendly interface to view genome annotations, the platform also includes tools and information (such as gene expression evidence) that enables authorized users to edit and validate genome annotations. The ORCAE-AOCC platform will enable various stakeholders from around the world to coordinate their efforts to annotate and study underutilized crops.
  39. Gonçalves, M. F., Nunes, R. B., Tilleman, L., Van de Peer, Y., Deforce, D., Van Nieuwerburgh, F., … Alves, A. (2019). Dual RNA sequencing of Vitis vinifera during Lasiodiplodia theobromae infection unveils host-pathogen interactions. INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 20(23).
    Lasiodiplodia theobromae is one of the most aggressive agents of the grapevine trunk disease Botryosphaeria dieback. Through a dual RNA-sequencing approach, this study aimed to give a broader perspective on the infection strategy deployed by L. theobromae, while understanding grapevine response. Approximately 0.05% and 90% of the reads were mapped to the genomes of L. theobromae and Vitis vinifera, respectively. Over 2500 genes were significantly differentially expressed in infected plants after 10 dpi, many of which are involved in the inducible defense mechanisms of grapevines. Gene expression analysis showed changes in the fungal metabolism of phenolic compounds, carbohydrate metabolism, transmembrane transport, and toxin synthesis. These functions are related to the pathogenicity mechanisms involved in plant cell wall degradation and fungal defense against antimicrobial substances produced by the host. Genes encoding for the degradation of plant phenylpropanoid precursors were up-regulated, suggesting that the fungus could evade the host defense response using the phenylpropanoid pathway. The up-regulation of many distinct components of the phenylpropanoid pathway in plants supports this hypothesis. Moreover, genes related to phytoalexin biosynthesis, hormone metabolism, cell wall modification enzymes, and pathogenesis-related proteins seem to be involved in the host responses observed. This study provides additional insights into the molecular mechanisms of L. theobromae and V. vinifera interactions.
  40. Linsmith, G., Rombauts, S., Montanari, S., Deng, C. H., Celton, J.-M., Guérif, P., … Bianco, L. (2019). Pseudo-chromosome-length genome assembly of a double haploid “Bartlett” pear (Pyrus communis L.). GIGASCIENCE, 8(12).
    BACKGROUND: We report an improved assembly and scaffolding of the European pear (Pyrus communis L.) genome (referred to as BartlettDHv2.0), obtained using a combination of Pacific Biosciences RSII long-read sequencing, Bionano optical mapping, chromatin interaction capture (Hi-C), and genetic mapping. The sample selected for sequencing is a double haploid derived from the same "Bartlett" reference pear that was previously sequenced. Sequencing of di-haploid plants makes assembly more tractable in highly heterozygous species such as P. communis. FINDINGS: A total of 496.9 Mb corresponding to 97% of the estimated genome size were assembled into 494 scaffolds. Hi-C data and a high-density genetic map allowed us to anchor and orient 87% of the sequence on the 17 pear chromosomes. Approximately 50% (247 Mb) of the genome consists of repetitive sequences. Gene annotation confirmed the presence of 37,445 protein-coding genes, which is 13% fewer than previously predicted. CONCLUSIONS: We showed that the use of a doubled-haploid plant is an effective solution to the problems presented by high levels of heterozygosity and duplication for the generation of high-quality genome assemblies. We present a high-quality chromosome-scale assembly of the European pear Pyrus communis and demostrate its high degree of synteny with the genomes of Malus x Domestica and Pyrus x bretschneideri.
  41. Tan, M. P., Wong, L. L., Razali, S. A., Afiqah-Aleng, N., Mohd Nor, S. A., Sung, Y. Y., … Danish-Daniel, M. (2019). Applications of next-generation sequencing technologies and computational tools in molecular evolution and aquatic animals conservation studies : a short review. EVOLUTIONARY BIOINFORMATICS, 15.
    Aquatic ecosystems that form major biodiversity hotspots are critically threatened due to environmental and anthropogenic stressors. We believe that, in this genomic era, computational methods can be applied to promote aquatic biodiversity conservation by addressing questions related to the evolutionary history of aquatic organisms at the molecular level. However, huge amounts of genomics data generated can only be discerned through the use of bioinformatics. Here, we examine the applications of next-generation sequencing technologies and bioinformatics tools to study the molecular evolution of aquatic animals and discuss the current challenges and future perspectives of using bioinformatics toward aquatic animal conservation efforts.
  42. Chang, Yue, Liu, H., Liu, M., Liao, X., Sahu, S. K., Fu, Y., Song, B., et al. (2019). The draft genomes of five agriculturally important African orphan crops. GIGASCIENCE, 8(3).
  43. Van de Peer, Y., & Pires, J. C. (2018). Editorial overview: Genome studies and molecular genetics : treasure troves of evolution. CURRENT OPINION IN PLANT BIOLOGY, 42, III–V.
  44. de Jonge, R., Ebert, M. K., Huitt-Roehl, C. R., Pal, P., Suttle, J. C., Spanner, R. E., … Bolton, M. D. (2018). Gene cluster conservation provides insight into cercosporin biosynthesis and extends production to the genus Colletotrichum. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 115(24), E5459–E5466.
    Species in the genus Cercospora cause economically devastating diseases in sugar beet, maize, rice, soy bean, and other major food crops. Here, we sequenced the genome of the sugar beet pathogen Cercospora beticola and found it encodes 63 putative secondary metabolite gene clusters, including the cercosporin toxin biosynthesis (CTB) cluster. We show that the CTB gene cluster has experienced multiple duplications and horizontal transfers across a spectrum of plant pathogenic fungi, including the wide-host range Colletotrichum genus as well as the rice pathogen Magnaporthe oryzae. Although cercosporin biosynthesis has been thought to rely on an eight-gene CTB cluster, our phylogenomic analysis revealed gene collinearity adjacent to the established cluster in all CTB cluster-harboring species. We demonstrate that the CTB cluster is larger than previously recognized and includes cercosporin facilitator protein, previously shown to be involved with cercosporin autoresistance, and four additional genes required for cercosporin biosynthesis, including the final pathway enzymes that install the unusual cercosporin methylenedioxy bridge. Lastly, we demonstrate production of cercosporin by Colletotrichum fioriniae, the first known cercosporin producer within this agriculturally important genus. Thus, our results provide insight into the intricate evolution and biology of a toxin critical to agriculture and broaden the production of cercosporin to another fungal genus containing many plant pathogens of important crops worldwide.
  45. Lin, Y.-C., Wang, J., Delhomme, N., Schiffthaler, B., Sundstrom, G., Zuccolo, A., Nystedt, B., et al. (2018). Functional and evolutionary genomic inferences in Populus through genome and population sequencing of American and European aspen. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 115(46), E10970–E10978.
    The Populus genus is one of the major plant model systems, but genomic resources have thus far primarily been available for poplar species, and primarily Populus trichocarpa (Torr. & Gray), which was the first tree with a whole-genome assembly. To further advance evolutionary and functional genomic analyses in Populus, we produced genome assemblies and population genetics resources of two aspen species, Populus tremula L. and Populus tremuloides Michx. The two aspen species have distributions spanning the Northern Hemisphere, where they are keystone species supporting a wide variety of dependent communities and produce a diverse array of secondary metabolites. Our analyses show that the two aspens share a similar genome structure and a highly conserved gene content with P. trichocarpa but display substantially higher levels of heterozygosity. Based on population resequencing data, we observed widespread positive and negative selection acting on both coding and noncoding regions. Furthermore, patterns of genetic diversity and molecular evolution in aspen are influenced by a number of features, such as expression level, coexpression network connectivity, and regulatory variation. To maximize the community utility of these resources, we have integrated all presented data within the PopGenIE web resource (popGenIE.org).
  46. Burgess, S. T., Bartley, K., Nunn, F., Wright, H. W., Hughes, M., Gemmell, M., … Nisbet, A. J. (2018). Draft genome assembly of the poultry red mite, Dermanyssus gallinae. MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 7(18). https://doi.org/10.1128/mra.01221-18
    The poultry red mite, Dermanyssus gallinae, is a major worldwide concern in the egg-laying industry. Here, we report the first draft genome assembly and gene prediction of Dermanyssus gallinae, based on combined PacBio and MinION long-read de novo sequencing. The ∼959-Mb genome is predicted to encode 14,608 protein-coding genes.
  47. De Tiège, A., Van de Peer, Y., Braeckman, J., & Tanghe, K. (2018). The sociobiology of genes : the gene’s eye view as a unifying behavioural-ecological framework for biological evolution. HISTORY AND PHILOSOPHY OF THE LIFE SCIENCES, 40.
    Although classical evolutionary theory, i.e., population genetics and the Modern Synthesis, was already implicitly 'gene-centred', the organism was, in practice, still generally regarded as the individual unit of which a population is composed. The gene-centred approach to evolution only reached a logical conclusion with the advent of the gene-selectionist or gene's eye view in the 1960s and 1970s. Whereas classical evolutionary theory can only work with (genotypically represented) fitness differences between individual organisms, gene-selectionism is capable of working with fitness differences among genes within the same organism and genome. Here, we explore the explanatory potential of 'intra-organismic' and 'intra-genomic' gene-selectionism, i.e., of a behavioural-ecological 'gene's eye view' on genetic, genomic and organismal evolution. First, we give a general outline of the framework and how it complements the-to some extent-still 'organism-centred' approach of classical evolutionary theory. Secondly, we give a more in-depth assessment of its explanatory potential for biological evolution, i.e., for Darwin's 'common descent with modification' or, more specifically, for 'historical continuity or homology with modular evolutionary change' as it has been studied by evolutionary developmental biology (evo-devo) during the last few decades. In contrast with classical evolutionary theory, evo-devo focuses on 'within-organism' developmental processes. Given the capacity of gene-selectionism to adopt an intra-organismal gene's eye view, we outline the relevance of the latter model for evo-devo. Overall, we aim for the conceptual integration between the gene's eye view on the one hand, and more organism-centred evolutionary models (both classical evolutionary theory and evo-devo) on the other.
  48. Novikova, P., Hohmann, N., & Van de Peer, Y. (2018). Polyploid Arabidopsis species originated around recent glaciation maxima. (Y. Van de Peer, Ed.)CURRENT OPINION IN PLANT BIOLOGY, 42, 8–15.
    Polyploidy may provide adaptive advantages and is considered to be important for evolution and speciation. Polyploidy events are found throughout the evolutionary history of plants, however they do not seem to be uniformly distributed along the time axis. For example, many of the detected ancient whole-genome duplications (WGDs) seem to cluster around the K/Pg boundary (similar to 66 Mya), which corresponds to a drastic climate change event and a mass extinction. Here, we discuss more recent polyploidy events using Arabidopsis as the most developed plant model at the level of the entire genus. We review the history of the origin of allotetraploid species A. suecica and A. kamchatica, and tetraploid lineages of A. lyrata, A. arenosa and A. thaliana, and discuss potential adaptive advantages. Also, we highlight an association between recent glacial maxima and estimated times of origins of polyploidy in Arabidopsis. Such association might further support a link between polyploidy and environmental challenge, which has been observed now for different time scales and for both ancient and recent polyploids.
  49. Nishiyama, T., Sakayama, H., de Vries, J., Buschmann, H., Saint-Marcoux, D., Ullrich, K. K., Haas, F. B., et al. (2018). The Chara genome : secondary complexity and implications for plant terrestrialization. CELL, 174(2), 448–464.
    Land plants evolved from charophytic algae, among which Charophyceae possess the most complex body plans. We present the genome of Chara braunii; comparison of the genome to those of land plants identified evolutionary novelties for plant terrestrialization and land plant heritage genes. C. braunii employs unique xylan synthases for cell wall biosynthesis, a phragmoplast (cell separation) mechanism similar to that of land plants, and many phytohormones. C. braunii plastids are controlled via landplant- like retrograde signaling, and transcriptional regulation is more elaborate than in other algae. The morphological complexity of this organism may result from expanded gene families, with three cases of particular note: genes effecting tolerance to reactive oxygen species (ROS), LysM receptor-like kinases, and transcription factors (TFs). Transcriptomic analysis of sexual reproductive structures reveals intricate control by TFs, activity of the ROS gene network, and the ancestral use of plant-like storage and stress protection proteins in the zygote.
  50. Phoma, S., Vikram, S., Jansson, J. K., Ansorge, I. J., Cowan, D. A., Van de Peer, Y., & Makhalanyane, T. P. (2018). Agulhas Current properties shape microbial community diversity and potential functionality. SCIENTIFIC REPORTS, 8.
    Understanding the impact of oceanographic features on marine microbial ecosystems remains a major ecological endeavour. Here we assess microbial diversity, community structure and functional capacity along the Agulhas Current system and the Subtropical Front in the South Indian Ocean (SIO). Samples collected from the epipelagic, oxygen minimum and bathypelagic zones were analysed by 16S rRNA gene amplicon and metagenomic sequencing. In contrast to previous studies, we found high taxonomic richness in surface and deep water samples, but generally low richness for OMZ communities. Beta-diversity analysis revealed significant dissimilarity between the three water depths. Most microbial communities were dominated by marine Gammaproteobacteria, with strikingly low levels of picocyanobacteria. Community composition was strongly influenced by specific environmental factors including depth, salinity, and the availability of both oxygen and light. Carbon, nitrogen and sulfur cycling capacity in the SIO was linked to several autotrophic and copiotrophic Alphaproteobacteria and Gammaproteobacteria. Taken together, our data suggest that the environmental conditions in the Agulhas Current system, particularly depth-related parameters, substantially influence microbial community structure. In addition, the capacity for biogeochemical cycling of nitrogen and sulfur is linked primarily to the dominant Gammaproteobacteria taxa, whereas ecologically rare taxa drive carbon cycling.
  51. Burgess, S. T., Bartley, K., Marr, E. J., Wright, H. W., Weaver, R. J., Prickett, J. C., … Nisbet, A. J. (2018). Draft genome assembly of the sheep scab mite, Psoroptes ovis. MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 6(16). https://doi.org/10.1128/genomea.00265-18
    Sheep scab, caused by infestation with Psoroptes ovis, is highly contagious, results in intense pruritus, and represents a major welfare and economic concern. Here, we report the first draft genome assembly and gene prediction of P. ovis based on PacBio de novo sequencing. The ∼63.2-Mb genome encodes 12,041 protein-coding genes.
  52. Van Bel, M., Diels, T., Vancaester, E., Kreft, L., Botzki, A., Van de Peer, Y., Coppens, F., et al. (2018). PLAZA 4.0 : an integrative resource for functional, evolutionary and comparative plant genomics. NUCLEIC ACIDS RESEARCH, 46(D1), D1190–D1196.
    PLAZA (https://bioinformatics.psb.ugent.be/plaza) is a plant-oriented online resource for comparative, evolutionary and functional genomics. The PLAZA platform consists of multiple independent instances focusing on different plant clades, while also providing access to a consistent set of reference species. Each PLAZA instance contains structural and functional gene annotations, gene family data and phylogenetic trees and detailed gene colinearity information. A user-friendly web interface makes the necessary tools and visualizations accessible, specific for each data type. Here we present PLAZA 4.0, the latest iteration of the PLAZA framework. This version consists of two new instances (Dicots 4.0 and Monocots 4.0) providing a large increase in newly available species, and offers access to updated and newly implemented tools and visualizations, helping users with the ever-increasing demands for complex and in-depth analyzes. The total number of species across both instances nearly doubles from 37 species in PLAZA 3.0 to 71 species in PLAZA 4.0, with a much broader coverage of crop species (e.g. wheat, palm oil) and species of evolutionary interest (e.g. spruce, Marchantia). The new PLAZA instances can also be accessed by a programming interface through a RESTful web service, thus allowing bioinformaticians to optimally leverage the power of the PLAZA platform.
  53. Van Goethem, M. W., Pierneef, R., Bezuidt, O. K., Van de Peer, Y., Cowan, D. A., & Makhalanyane, T. P. (2018). A reservoir of “historical” antibiotic resistance genes in remote pristine Antarctic soils. MICROBIOME, 6.
    Background: Soil bacteria naturally produce antibiotics as a competitive mechanism, with a concomitant evolution, and exchange by horizontal gene transfer, of a range of antibiotic resistance mechanisms. Surveys of bacterial resistance elements in edaphic systems have originated primarily from human-impacted environments, with relatively little information from remote and pristine environments, where the resistome may comprise the ancestral gene diversity. Methods: We used shotgun metagenomics to assess antibiotic resistance gene (ARG) distribution in 17 pristine and remote Antarctic surface soils within the undisturbed Mackay Glacier region. We also interrogated the phylogenetic placement of ARGs compared to environmental ARG sequences and tested for the presence of horizontal gene transfer elements flanking ARGs. Results: In total, 177 naturally occurring ARGs were identified, most of which encoded single or multi-drug efflux pumps. Resistance mechanisms for the inactivation of aminoglycosides, chloramphenicol and beta-lactam antibiotics were also common. Gram-negative bacteria harboured most ARGs (71%), with fewer genes from Gram-positive Actinobacteria and Bacilli (Firmicutes) (9%), reflecting the taxonomic composition of the soils. Strikingly, the abundance of ARGs per sample had a strong, negative correlation with species richness (r=-0.49, P < 0.05). This result, coupled with a lack of mobile genetic elements flanking ARGs, suggests that these genes are ancient acquisitions of horizontal transfer events. Conclusions: ARGs in these remote and uncontaminated soils most likely represent functional efficient historical genes that have since been vertically inherited over generations. The historical ARGs in these pristine environments carry a strong phylogenetic signal and form a monophyletic group relative to ARGs from other similar environments.
  54. Wan, T., Liu, Z.-M., Li, L.-F., Leitch, A. R., Leitch, I. J., Lohaus, R., Liu, Z.-J., et al. (2018). A genome for gnetophytes and early evolution of seed plants. NATURE PLANTS, 4(2), 82–89.
    Gnetophytes are an enigmatic gymnosperm lineage comprising three genera, Gnetum, Welwitschia and Ephedra, which are morphologically distinct from all other seed plants. Their distinctiveness has triggered much debate as to their origin, evolution and phylogenetic placement among seed plants. To increase our understanding of the evolution of gnetophytes, and their relation to other seed plants, we report here a high-quality draft genome sequence for Gnetum montanum, the first for any gnetophyte. By using a novel genome assembly strategy to deal with high levels of heterozygosity, we assembled >4 Gb of sequence encoding 27,491 protein-coding genes. Comparative analysis of the G. montanum genome with other gymnosperm genomes unveiled some remarkable and distinctive genomic features, such as a diverse assemblage of retrotransposons with evidence for elevated frequencies of elimination rather than accumulation, considerable differences in intron architecture, including both length distribution and proportions of (retro) transposon elements, and distinctive patterns of proliferation of functional protein domains. Furthermore, a few gene families showed Gnetum-specific copy number expansions (for example, cellulose synthase) or contractions (for example, Late Embryogenesis Abundant protein), which could be connected with Gnetum's distinctive morphological innovations associated with their adaptation to warm, mesic environments. Overall, the G. montanum genome enables a better resolution of ancestral genomic features within seed plants, and the identification of genomic characters that distinguish Gnetum from other gymnosperms.
  55. Zwaenepoel, Arthur, Diels, T., Amar, D., Van Parys, T., Shamir, R., Van de Peer, Y., & Tzfadia, O. (2018). MorphDB : prioritizing genes for specialized metabolism pathways and gene ontology categories in plants. FRONTIERS IN PLANT SCIENCE, 9.
    Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.
  56. Van de Peer, Y. (2018). Size does matter. NATURE PLANTS.
    Chromosome-scale assemblies are quickly becoming the standard for high-quality de novo reference plant genomes. Combining nanopore technology sequencing and optical map information is one way to achieve this.
  57. Heydari, M., Miclotte, G., Van de Peer, Y., & Fostier, J. (2018). BrownieAligner : accurate alignment of Illumina sequencing data to de Bruijn graphs. BMC BIOINFORMATICS, 19.
    Background: Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string. Results: We present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure. Conclusions: BrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems.
  58. Tzfadia, O., Bocobza, S., Defoort, J., Almekias-Siegl, E., Panda, S., Levy, M., Storme, V., et al. (2018). The “TranSeq” 3’-end sequencing method for high-throughput transcriptomics and gene space refinement in plant genomes. PLANT JOURNAL, 96(1), 223–232.
    High-throughput RNA sequencing has proven invaluable not only to explore gene expression but also for both gene prediction and genome annotation. However, RNA sequencing, carried out on tens or even hundreds of samples, requires easy and cost-effective sample preparation methods using minute RNA amounts. Here, we present TranSeq, a high-throughput 3'-end sequencing procedure that requires 10- to 20-fold fewer sequence reads than the current transcriptomics procedures. TranSeq significantly reduces costs and allows a greater increase in size of sample sets analyzed in a single experiment. Moreover, in comparison with other 3'-end sequencing methods reported to date, we demonstrate here the reliability and immediate applicability of TranSeq and show that it not only provides accurate transcriptome profiles but also produces precise expression measurements of specific gene family members possessing high sequence similarity. This is difficult to achieve in standard RNA-seq methods, in which sequence reads cover the entire transcript. Furthermore, mapping TranSeq reads to the reference tomato genome facilitated the annotation of new transcripts improving >45% of the existing gene models. Hence, we anticipate that using TranSeq will boost large-scale transcriptome assays and increase the spatial and temporal resolution of gene expression data, in both model and non-model plant species. Moreover, as already performed for tomato (ITAG3.0; www.solgenomics.net), we strongly advocate its integration into current and future genome annotations.
  59. Khayi, S., Azza, N. E., Gaboun, F., Pirro, S., Badad, O., Claros, M. G., Lightfoot, D. A., et al. (2018). First draft genome assembly of the Argane tree (Argania spinosa). F1000RESEARCH, 7.
    Background: The Argane tree (Argania spinosa L. Skeels) is an endemic tree of southwestern Morocco that plays an important socioeconomic and ecologic role for a dense human population in an arid zone. Several studies confirmed the importance of this species as a food and feed source and as a resource for both pharmaceutical and cosmetic compounds. Unfortunately, the argane tree ecosystem is facing significant threats from environmental changes (global warming, over-population) and over-exploitation. Limited research has been conducted, however, on argane tree genetics and genomics, which hinders its conservation and genetic improvement. Methods: Here, we present a draft genome assembly of A. spinosa. A reliable reference genome of A. spinosa was created using a hybrid de novo assembly approach combining short and long sequencing reads. Results: In total, 144 Gb Illumina HiSeq reads and 7.2 Gb PacBio reads were produced and assembled. The final draft genome comprises 75 327 scaffolds totaling 671 Mb with an N50 of 49 916 kb. The draft assembly is close to the genome size estimated by k-mers distribution and covers 89% of complete and 4.3 % of partial Arabidopsis orthologous groups in BUSCO. Conclusion: The A. spinosa genome will be useful for assessing biodiversity leading to efficient conservation of this endangered endemic tree. Furthermore, the genome may enable genome-assisted cultivar breeding, and provide a better understanding of important metabolic pathways and their underlying genes for both cosmetic and pharmacological purposes.
  60. Li, F.-W., Brouwer, P., Carretero-Paulet, L., Cheng, S., de Vries, J., Delaux, P.-M., Eily, A., et al. (2018). Fern genomes elucidate land plant evolution and cyanobacterial symbioses. NATURE PLANTS, 4(7), 460–472.
    Ferns are the closest sister group to all seed plants, yet little is known about their genomes other than that they are generally colossal. Here, we report on the genomes of Azolla filiculoides and Salvinia cucullata (Salviniales) and present evidence for episodic whole-genome duplication in ferns-one at the base of 'core leptosporangiates' and one specific to Azolla. One fernspecific gene that we identified, recently shown to confer high insect resistance, seems to have been derived from bacteria through horizontal gene transfer. Azolla coexists in a unique symbiosis with N-2-fixing cyanobacteria, and we demonstrate a clear pattern of cospeciation between the two partners. Furthermore, the Azolla genome lacks genes that are common to arbuscular mycorrhizal and root nodule symbioses, and we identify several putative transporter genes specific to Azolla-cyanobacterial symbiosis. These genomic resources will help in exploring the biotechnological potential of Azolla and address fundamental questions in the evolution of plant life.
  61. Ramsak, Z., Coll, A., Stare, T., Tzfadia, O., Baebler, S., Van de Peer, Y., & Gruden, K. (2018). Network modeling unravels mechanisms of crosstalk between ethylene and salicylate signaling in potato. PLANT PHYSIOLOGY, 178(1), 488–499.
    To develop novel crop breeding strategies, it is crucial to understand the mechanisms underlying the interaction between plants and their pathogens. Network modeling represents a powerful tool that can unravel properties of complex biological systems. In this study, we aimed to use network modeling to better understand immune signaling in potato (Solanum tuberosum). For this, we first built on a reliable Arabidopsis (Arabidopsis thaliana) immune signaling model, extending it with the information from diverse publicly available resources. Next, we translated the resulting prior knowledge network (20,012 nodes and 70,091 connections) to potato and superimposed it with an ensemble network inferred from time-resolved transcriptomics data for potato. We used different network modeling approaches to generate specific hypotheses of potato immune signaling mechanisms. An interesting finding was the identification of a string of molecular events illuminating the ethylene pathway modulation of the salicylic acid pathway through Nonexpressor of PR Genesi gene expression. Functional validations confirmed this modulation, thus supporting the potential of our integrative network modeling approach for unraveling molecular mechanisms in complex systems. In addition, this approach can ultimately result in improved breeding strategies for potato and other sensitive crops.
  62. Defoort, J., Van de Peer, Y., & Vermeirssen, V. (2018). Function, dynamics and evolution of network motif modules in integrated gene regulatory networks of worm and plant. NUCLEIC ACIDS RESEARCH, 46(13), 6480–6503.
    Gene regulatory networks (GRNs) consist of different molecular interactions that closely work together to establish proper gene expression in time and space. Especially in higher eukaryotes, many questions remain on how these interactions collectively coordinate gene regulation. We study high quality GRNs consisting of undirected protein-protein, genetic and homologous interactions, and directed protein-DNA, regulatory and miRNA-mRNA interactions in the worm Caenorhabditis elegans and the plant Ara-bidopsis thaliana. Our data-integration framework integrates interactions in composite network motifs, clusters these in biologically relevant, higher-order topological network motif modules, overlays these with gene expression profiles and discovers novel connections between modules and regulators. Similar modules exist in the integrated GRNs of worm and plant. We show how experimental or computational methodologies underlying a certain data type impact network topology. Through phylogenetic decomposition, we found that proteins of worm and plant tend to functionally interact with proteins of a similar age, while at the regulatory level TFs favor same age, but also older target genes. Despite some influence of the duplication mode difference, we also observe at the motif and module level for both species a preference for age homogeneity for undirected and age heterogeneity for directed interactions. This leads to a model where novel genes are added together to the GRNs in a specific biological functional context, regulated by one or more TFs that also target older genes in the GRNs. Overall, we detected topological, functional and evolutionary properties of GRNs that are potentially universal in all species.
  63. De Clerck, Olivier, Kao, S.-M., Bogaert, K., Blomme, J., Foflonker, F., Kwantes, M., Vancaester, E., et al. (2018). Insights into the evolution of multicellularity from the sea lettuce genome. CURRENT BIOLOGY, 28(18), 2921–2933.
    We report here the 98.5 Mbp haploid genome (12,924 protein coding genes) of Ulva mutabilis, a ubiquitous and iconic representative of the Ulvophyceae or green seaweeds. Ulva's rapid and abundant growth makes it a key contributor to coastal biogeochemical cycles; its role in marine sulfur cycles is particularly important because it produces high levels of dimethylsulfoniopropionate (DMSP), the main precursor of volatile dimethyl sulfide (DMS). Rapid growth makes Ulva attractive biomass feedstock but also increasingly a driver of nuisance "green tides." Ulvophytes are key to understanding the evolution of multicellularity in the green lineage, and Ulva morphogenesis is dependent on bacterial signals, making it an important species with which to study cross-kingdom communication. Our sequenced genome informs these aspects of ulvophyte cell biology, physiology, and ecology. Gene family expansions associated with multicellularity are distinct from those of freshwater algae. Candidate genes, including some that arose following horizontal gene transfer from chromalveolates, are present for the transport and metabolism of DMSP. The Ulva genome offers, therefore, new opportunities to understand coastal and marine ecosystems and the fundamental evolution of the green lineage.
  64. Christie, N., Myburg, A. A., Joubert, F., Murray, S. L., Carstens, M., Lin, Y.-C., Meyer, J., et al. (2017). Systems genetics reveals a transcriptional network associated with susceptibility in the maize-grey leaf spot pathosystem. PLANT JOURNAL, 89(4), 746–763.
    We used a systems genetics approach to elucidate the molecular mechanisms of the responses of maize to grey leaf spot (GLS) disease caused by Cercosporazeina, a threat to maize production globally. Expression analysis of earleaf samples in a subtropical maize recombinant inbred line population (CML444xSC Malawi) subjected in the field to C. zeina infection allowed detection of 20206 expression quantitative trait loci (eQTLs). Four trans-eQTL hotspots coincided with GLS disease QTLs mapped in the same field experiment. Co-expression network analysis identified three expression modules correlated with GLS disease scores. The module (GY-s) most highly correlated with susceptibility (r=0.71; 179 genes) was enriched for the glyoxylate pathway, lipid metabolism, diterpenoid biosynthesis and responses to pathogen molecules such as chitin. The GY-s module was enriched for genes with trans-eQTLs in hotspots on chromosomes 9 and 10, which also coincided with phenotypic QTLs for susceptibility to GLS. This transcriptional network has significant overlap with the GLS susceptibility response of maize line B73, and may reflect pathogen manipulation for nutrient acquisition and/or unsuccessful defence responses, such as kauralexin production by the diterpenoid biosynthesis pathway. The co-expression module that correlated best with resistance (TQ-r; 1498 genes) was enriched for genes with trans-eQTLs in hotspots coinciding with GLS resistance QTLs on chromosome 9. Jasmonate responses were implicated in resistance to GLS through co-expression of COI1 and enrichment of genes with the Gene Ontology term cullin-RING ubiquitin ligase complex' in the TQ-r module. Consistent with this, JAZ repressor expression was highly correlated with the severity of GLS disease in the GY-s susceptibility network.
  65. Mizrachi, E., Verbeke, L., Van de Peer, Y., Marchal, K., & Myburg, A. A. (2017). Principles of systems biology, no. 14 : [...] Network analysis of woody biomass. CELL SYSTEMS.
    This month: sage advice from phage to their offspring; systematic analyses of protein quality control, mitochondrial respiration, and woody biomass; a continental-scale experiment; and engineered protein tools galore.
  66. Van Parys, T., Melckenbeeck, I., Houbraken, M., Audenaert, P., Colle, D., Pickavet, M., Demeester, P., et al. (2017). A Cytoscape app for motif enumeration with ISMAGS. BIOINFORMATICS, 33(3), 461–463.
    We present a Cytoscape app for the ISMAGS algorithm, which can enumerate all instances of a motif in a graph, making optimal use of the motif's symmetries to make the search more efficient. The Cytoscape app provides a handy interface for this algorithm, which allows more efficient network analysis.
  67. Mizrachi, E., Verbeke, L., Christie, N., Fierro Gutierrez, A. C. E., Mansfield, S. D., Davis, M. F., Gjersing, E., et al. (2017). Network-based integration of systems genetics data reveals pathways associated with lignocellulosic biomass accumulation and processing. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 114(5), 1195–1200.
    As a consequence of their remarkable adaptability, fast growth, and superior wood properties, eucalypt tree plantations have emerged as key renewable feedstocks (over 20 million ha globally) for the production of pulp, paper, bioenergy, and other lignocellulosic products. However, most biomass properties such as growth, wood density, and wood chemistry are complex traits that are hard to improve in long-lived perennials. Systems genetics, a process of harnessing multiple levels of component trait information (e.g., transcript, protein, and metabolite variation) in populations that vary in complex traits, has proven effective for dissecting the genetics and biology of such traits. We have applied a network-based data integration (NBDI) method for a systems-level analysis of genes, processes and pathways underlying biomass and bioenergy-related traits using a segregating Eucalyptus hybrid population. We show that the integrative approach can link biologically meaningful sets of genes to complex traits and at the same time reveal the molecular basis of trait variation. Gene sets identified for related woody biomass traits were found to share regulatory loci, cluster in network neighborhoods, and exhibit enrichment for molecular functions such as xylan metabolism and cell wall development. These findings offer a framework for identifying the molecular underpinnings of complex biomass and bioprocessing-related traits. A more thorough understanding of the molecular basis of plant biomass traits should provide additional opportunities for the establishment of a sustainable bio-based economy.
  68. Cormier, A., Avia, K., Sterck, L., Derrien, T., Wucher, V., Andres, G., Monsoor, M., et al. (2017). Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus. NEW PHYTOLOGIST, 214(1), 219–232.
    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles.
  69. De La Torre, A. R., Li, Z., Van de Peer, Y., & Ingvarsson, P. K. (2017). Contrasting rates of molecular evolution and patterns of selection among gymnosperms and flowering plants. MOLECULAR BIOLOGY AND EVOLUTION, 34(6), 1363–1377.
    The majority of variation in rates of molecular evolution among seed plants remains both unexplored and unexplained. Although some attention has been given to flowering plants, reports of molecular evolutionary rates for their sister plant clade (gymnosperms) are scarce, and to our knowledge differences in molecular evolution among seed plant clades have never been tested in a phylogenetic framework. Angiosperms and gymnosperms differ in a number of features, of which contrasting reproductive biology, life spans, and population sizes are the most prominent. The highly conserved morphology of gymnosperms evidenced by similarity of extant species to fossil records and the high levels of macrosynteny at the genomic level have led scientists to believe that gymnosperms are slow-evolving plants, although some studies have offered contradictory results. Here, we used 31,968 nucleotide sites obtained from orthologous genes across a wide taxonomic sampling that includes representatives of most conifers, cycads, ginkgo, and many angiosperms with a sequenced genome. Our results suggest that angiosperms and gymnosperms differ considerably in their rates of molecular evolution per unit time, with gymnosperm rates being, on average, seven times lower than angiosperm species. Longer generation times and larger genome sizes are some of the factors explaining the slow rates of molecular evolution found in gymnosperms. In contrast to their slow rates of molecular evolution, gymnosperms possess higher substitution rate ratios than angiosperm taxa. Finally, our study suggests stronger and more efficient purifying and diversifying selection in gymnosperm than in angiosperm species, probably in relation to larger effective population sizes.
  70. Causier, B., Li, Z., De Smet, R., Lloyd, J. P., Van de Peer, Y., & Davies, B. (2017). Conservation of nonsense-mediated mRNA decay complex components throughout eukaryotic evolution. SCIENTIFIC REPORTS, 7.
    Nonsense-mediated mRNA decay (NMD) is an essential eukaryotic process regulating transcript quality and abundance, and is involved in diverse processes including brain development and plant defenses. Although some of the NMD machinery is conserved between kingdoms, little is known about its evolution. Phosphorylation of the core NMD component UPF1 is critical for NMD and is regulated in mammals by the SURF complex (UPF1, SMG1 kinase, SMG8, SMG9 and eukaryotic release factors). However, since SMG1 is reportedly missing from the genomes of fungi and the plant Arabidopsis thaliana, it remains unclear how UPF1 is activated outside the metazoa. We used comparative genomics to determine the conservation of the NMD pathway across eukaryotic evolution. We show that SURF components are present in all major eukaryotic lineages, including fungi, suggesting that in addition to UPF1 and SMG1, SMG8 and SMG9 also existed in the last eukaryotic common ancestor, 1.8 billion years ago. However, despite the ancient origins of the SURF complex, we also found that SURF factors have been independently lost across the Eukarya, pointing to genetic buffering within the essential NMD pathway. We infer an ancient role for SURF in regulating UPF1, and the intriguing possibility of undiscovered NMD regulatory pathways.
  71. Tasdighian, S., Van Bel, M., Li, Z., Van de Peer, Y., Carretero-Paulet, L., & Maere, S. (2017). Reciprocally retained genes in the angiosperm lineage show the hallmarks of dosage balance sensitivity. PLANT CELL, 29(11), 2766–2785.
    In several organisms, particular functional categories of genes, such as regulatory and complex-forming genes, are preferentially retained after whole-genome multiplications but rarely duplicate through small-scale duplication, a pattern referred to as reciprocal retention. This peculiar duplication behavior is hypothesized to stem from constraints on the dosage balance between the genes concerned and their interaction context. However, the evidence for a relationship between reciprocal retention and dosage balance sensitivity remains fragmentary. Here, we identified which gene families are most strongly reciprocally retained in the angiosperm lineage and studied their functional and evolutionary characteristics. Reciprocally retained gene families exhibit stronger sequence divergence constraints and lower rates of functional and expression divergence than other gene families, suggesting that dosage balance sensitivity is a general characteristic of reciprocally retained genes. Gene families functioning in regulatory and signaling processes are much more strongly represented at the top of the reciprocal retention ranking than those functioning in multiprotein complexes, suggesting that regulatory imbalances may lead to stronger fitness effects than classical stoichiometric protein complex imbalances. Finally, reciprocally retained duplicates are often subject to dosage balance constraints for prolonged evolutionary times, which may have repercussions for the ease with which genome multiplications can engender evolutionary innovation.
  72. Wingfield, B. D., Berger, D. K., Steenkamp, E. T., Lim, H.-J., Duong, T. A., Bluhm, B. H., De Beer, Z. W., et al. (2017). Draft genome of Cercospora zeina, Fusarium pininemorale, Hawksworthiomyces lignivorus, Huntiella decipiens and Ophiostoma ips. IMA FUNGUS, 8(2), 385–396.
    The genomes of Cercospora zeina, Fusarium pininemorale, Hawksworthiomyces lignivorus, Huntiella decipiens, and Ophiostoma ips are presented in this genome announcement. Three of these genomes are from plant pathogens and otherwise economically important fungal species. Fusarium pininemorale and H. decipiens are not known to cause significant disease but are closely related to species of economic importance. The genome sizes range from 25.99 Mb in the case of O. ips to 4.82 Mb for H. lignivorus. These genomes include the first reports of a genome from the genus Hawksworthiomyces. The availability of these genome data will allow the resolution of longstanding questions regarding the taxonomy of these species. In addition these genome sequences through comparative studies with closely related organisms will increase our understanding of how these species or close relatives cause disease.
  73. Cañas, R. A., Li, Z., Pascual, M. B., Castro-Rodríguez, V., Ávila, C., Sterck, L., Van de Peer, Y., et al. (2017). The gene expression landscape of pine seedling tissues. PLANT JOURNAL, 91(6), 1064–1087.
    Conifers dominate vast regions of the Northern hemisphere. They are the main source of raw materials for timber industry as well as a wide range of biomaterials. Despite their inherent difficulties as experimental models for classical plant biology research, the technological advances in genomics research are enabling fundamental studies on these plants. The use of laser capture microdissection followed by transcriptomic analysis is a powerful tool for unravelling the molecular and functional organization of conifer tissues and specialized cells. In the present work, 14 different tissues from 1-month-old maritime pine (Pinus pinaster) seedlings have been isolated and their transcriptomes analysed. The results increased the sequence information and number of full-length transcripts from a previous reference transcriptome and added 39 841 new transcripts. In total, 2376 transcripts were ubiquitously expressed in all of the examined tissues. These transcripts could be considered the core 'housekeeping genes' in pine. The genes have been clustered in function to their expression profiles. This analysis reduced the number of profiles to 38, most of these defined by their expression in a unique tissue that is much higher than in the other tissues. The expression and localization data are accessible at ConGenIE.org (http://v22.popgenie.org/microdisection/). This study presents an overview of the gene expression distribution in different pine tissues, specifically highlighting the relationships between tissue gene expression and function. This transcriptome atlas is a valuable resource for functional genomics research in conifers.
  74. Roodt, D., Lohaus, R., Sterck, L., Swanepoel, R. L., Van de Peer, Y., & Mizrachi, E. (2017). Evidence for an ancient whole genome duplication in the cycad lineage. PLOS ONE, 12(9).
    Contrary to the many whole genome duplication events recorded for angiosperms (flowering plants), whole genome duplications in gymnosperms (non-flowering seed plants) seem to be much rarer. Although ancient whole genome duplications have been reported for most gymnosperm lineages as well, some are still contested and need to be confirmed. For instance, data for ginkgo, but particularly cycads have remained inconclusive so far, likely due to the quality of the data available and flaws in the analysis. We extracted and sequenced RNA from both the cycad Encephalartos natalensis and Ginkgo biloba. This was followed by transcriptome assembly, after which these data were used to build paralog age distributions. Based on these distributions, we identified remnants of an ancient whole genome duplication in both cycads and ginkgo. The most parsimonious explanation would be that this whole genome duplication event was shared between both species and had occurred prior to their divergence, about 300 million years ago.
  75. Zhang, G.-Q., Liu, K.-W., Li, Z., Lohaus, R., Hsiao, Y.-Y., Niu, S.-C., Wang, J.-Y., et al. (2017). The Apostasia genome and the evolution of orchids. NATURE, 549(7672), 379–383.
    Constituting approximately 10% of flowering plant species, orchids (Orchidaceae) display unique flower morphologies, possess an extraordinary diversity in lifestyle, and have successfully colonized almost every habitat on Earth(1-3). Here we report the draft genome sequence of Apostasia shenzhenica(4), a representative of one of two genera that form a sister lineage to the rest of the Orchidaceae, providing a reference for inferring the genome content and structure of the most recent common ancestor of all extant orchids and improving our understanding of their origins and evolution. In addition, we present transcriptome data for representatives of Vanilloideae, Cypripedioideae and Orchidoideae, and novel third-generation genome data for two species of Epidendroideae, covering all five orchid subfamilies. A. shenzhenica shows clear evidence of a whole-genome duplication, which is shared by all orchids and occurred shortly before their divergence. Comparisons between A. shenzhenica and other orchids and angiosperms also permitted the reconstruction of an ancestral orchid gene toolkit. We identify new gene families, gene family expansions and contractions, and changes within MADS-box gene classes, which control a diverse suite of developmental processes, during orchid evolution. This study sheds new light on the genetic mechanisms underpinning key orchid innovations, including the development of the labellum and gynostemium, pollinia, and seeds without endosperm, as well as the evolution of epiphytism; reveals relationships between the Orchidaceae subfamilies; and helps clarify the evolutionary history of orchids within the angiosperms.
  76. Orr, Russell JS, Rombauts, S., Van de Peer, Y., & Shalchian-Tabrizi, K. (2017). Draft genome sequences of two unclassified Chitinophagaceae bacteria, IBVUCB1 and IBVUCB2, isolated from environmental samples. GENOME ANNOUNCEMENTS, 5(33).
    We report here the draft genome sequences of two Chitinophagaceae bacteria, IBVUCB1 and IBVUCB2, assembled from metagenomes of surface samples from freshwater lakes. The genomes are >99% complete and may represent new genera within the Chitinophagaceae family, indicating a larger diversity than currently identified.
  77. Unver, T., Wu, Z., Sterck, L., Turktas, M., Lohaus, R., Li, Z., Yang, M., et al. (2017). Genome of wild olive and the evolution of oil biosynthesis. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 114(44), E9413–E9422.
    Here we present the genome sequence and annotation of the wild olive tree (Olea europaea var. sylvestris), called oleaster, which is considered an ancestor of cultivated olive trees. More than 50,000 protein-coding genes were predicted, a majority of which could be anchored to 23 pseudochromosomes obtained through a newly constructed genetic map. The oleaster genome contains signatures of two Oleaceae lineage-specific paleopolyploidy events, dated at similar to 28 and similar to 59 Mya. These events contributed to the expansion and neo-functionalization of genes and gene families that play important roles in oil biosynthesis. The functional divergence of oil biosynthesis pathway genes, such as FAD2, SACPD, EAR, and ACPTE, following duplication, has been responsible for the differential accumulation of oleic and linoleic acids produced in olive compared with sesame, a closely related oil crop. Duplicated oleaster FAD2 genes are regulated by an siRNA derived from a transposable element-rich region, leading to suppressed levels of FAD2 gene expression. Additionally, neofunctionalization of members of the SACPD gene family has led to increased expression of SACPD2,3, 5, and 7, consequently resulting in an increased desaturation of steric acid. Taken together, decreased FAD2 expression and increased SACPD expression likely explain the accumulation of exceptionally high levels of oleic acid in olive. The oleaster genome thus provides important insights into the evolution of oil biosynthesis and will be a valuable resource for oil crop genomics.
  78. Ruprecht, C., Lohaus, R., Vanneste, K., Mutwil, M., Nikoloski, Z., Van de Peer, Y., & Persson, S. (2017). Revisiting ancestral polyploidy in plants. SCIENCE ADVANCES, 3(7).
    Whole-genome duplications (WGDs) or polyploidy events have been studied extensively in plants. In a now widely cited paper, Jiao et al. presented evidence for two ancient, ancestral plant WGDs predating the origin of flowering and seed plants, respectively. This finding was based primarily on a bimodal age distribution of gene duplication events obtained from molecular dating of almost 800 phylogenetic gene trees. We reanalyzed the phylogenomic data of Jiao et al. and found that the strong bimodality of the age distribution may be the result of technical and methodological issues and may hence not be a "true" signal of two WGD events. By using a state-of-the-art molecular dating algorithm, we demonstrate that the reported bimodal age distribution is not robust and should be interpreted with caution. Thus, there exists little evidence for two ancient WGDs in plants from phylogenomic dating.
  79. Heydari, M., Miclotte, G., Demeester, P., Van de Peer, Y., & Fostier, J. (2017). Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC BIOINFORMATICS, 18.
    Background: Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. Results: For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. Conclusions: We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
  80. Li, Zhen, De La Torre, A. R., Sterck, L., Cánovas, F. M., Avila, C., Merino, I., Cabezas, J. A., et al. (2017). Single-copy genes as molecular markers for phylogenomic studies in seed plants. GENOME BIOLOGY AND EVOLUTION, 9(5), 1130–1147.
    Phylogenetic relationships among seed plant taxa, especially within the gymnosperms, remain contested. In contrast to angio-sperms, for which several genomic, transcriptomic and phylogenetic resources are available, there are few, if any, molecular markers that allow broad comparisons among gymnosperm species. With few gymnosperm genomes available, recently obtained transcriptomes in gymnosperms are a great addition to identifying single-copy gene families as molecular markers for phylogenomic analysis in seed plants. Taking advantage of an increasing number of available genomes and transcriptomes, we identified single-copy genes in a broad collection of seed plants and used these to infer phylogenetic relationships between major seed plant taxa. This study aims at extending the current phylogenetic toolkit for seed plants, assessing its ability for resolving seed plant phylogeny, and discussing potential factors affecting phylogenetic reconstruction. In total, we identified 3,072 single-copy genes in 31 gymnosperms and 2,156 single-copy genes in 34 angiosperms. All studied seed plants shared 1,469 single-copy genes, which are generally involved in functions like DNA metabolism, cell cycle, and photosynthesis. A selected set of 106 single-copy genes provided good resolution for the seed plant phylogeny except for gnetophytes. Although some of our analyses support a sister relationship between gnetophytes and other gymnosperms, phylogenetic trees from concatenated alignments without 3rd codon positions and amino acid alignments under the CAT + GTR model, support gnetophytes as a sister group to Pinaceae. Our phylogenomic analyses demonstrate that, in general, single-copy genes can uncover both recent and deep divergences of seed plant phylogeny.
  81. De Smet, R., Sabaghian, E., Li, Z., Saeys, Y., & Van de Peer, Y. (2017). Coordinated functional divergence of genes after genome duplication in Arabidopsis thaliana. PLANT CELL, 29(11), 2786–2800.
    Gene and genome duplications have been rampant during the evolution of flowering plants. Unlike small-scale gene duplications, whole-genome duplications (WGDs) copy entire pathways or networks, and as such create the unique situation in which such duplicated pathways or networks could evolve novel functionality through the coordinated sub-or neofunctionalization of its constituent genes. Here, we describe a remarkable case of coordinated gene expression divergence following WGDs in Arabidopsis thaliana. We identified a set of 92 homoeologous gene pairs that all show a similar pattern of tissue-specific gene expression divergence following WGD, with one homoeolog showing predominant expression in aerial tissues and the other homoeolog showing biased expression in tip-growth tissues. We provide evidence that this pattern of gene expression divergence seems to involve genes with a role in cell polarity and that likely function in the maintenance of cell wall integrity. Following WGD, many of these duplicated genes evolved separate functions through subfunctionalization in growth/development and stress response. Uncoupling these processes through genome duplications likely provided important adaptations with respect to growth and morphogenesis and defense against biotic and abiotic stress.
  82. Orr, Russel JS, Rombauts, S., Van de Peer, Y., & Shalchian-Tabrizi, K. (2017). Draft genome sequences of two unclassified bacteria, Hydrogenophaga sp. strains IBVHS1 and IBVHS2, isolated from environmental samples. GENOME ANNOUNCEMENTS, 5(34).
  83. Miclotte, G., Plaisance, S., Rombauts, S., Van de Peer, Y., Audenaert, P., & Fostier, J. (2017). OMSim : a simulator for optical map data. BIOINFORMATICS, 33(17), 2740–2742.
    Motivation: The Bionano Genomics platform allows for the optical detection of short sequence patterns in very long DNA molecules (up to 2.5 Mbp). Molecules with overlapping patterns can be assembled to generate a consensus optical map of the entire genome. In turn, these optical maps can be used to validate or improve de novo genome assembly projects or to detect large-scale structural variation in genomes. Simulated optical map data can assist in the development and benchmarking of tools that operate on those data, such as alignment and assembly software. Additionally, it can help to optimize the experimental setup for a genome of interest. Such a simulator is currently not available. Results: We have developed a simulator, OMSim, that produces synthetic optical map data that mimics real Bionano Genomics data. These simulated data have been tested for compatibility with the Bionano Genomics Irys software system and the Irys-scaffolding scripts. OMSim is capable of handling very large genomes (over 30 Gbp) with high throughput and low memory requirements.
  84. Van de Peer, Y., Mizrachi, E., & Marchal, K. (2017). The evolutionary significance of polyploidy. NATURE REVIEWS GENETICS, 18(7), 411–424.
    Polyploidy, or the duplication of entire genomes, has been observed in prokaryotic and eukaryotic organisms, and in somatic and germ cells. The consequences of polyploidization are complex and variable, and they differ greatly between systems (clonal or non-clonal) and species, but the process has often been considered to be an evolutionary 'dead end'. Here, we review the accumulating evidence that correlates polyploidization with environmental change or stress, and that has led to an increased recognition of its short-term adaptive potential. In addition, we discuss how, once polyploidy has been established, the unique retention profile of duplicated genes following whole-genome duplication might explain key longer-term evolutionary transitions and a general increase in biological complexity.
  85. Yao, Y., & Van de Peer, Y. (2017). Simulating biological complexity through artificial evolution. In 2017 3RD IEEE INTERNATIONAL CONFERENCE ON CYBERNETICS (CYBCONF) (pp. 101–108). New York, NY, USA: IEEE.
  86. Vlastaridis, P., Kyriakidou, P., Chaliotis, A., Van de Peer, Y., Oliver, S. G., & Amoutzias, G. D. (2017). Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes. GIGASCIENCE, 6(2), 1–11.
    Background: Phosphorylation is the most frequent post-translational modification made to proteins and may regulate protein activity as either a molecular digital switch or a rheostat. Despite the cornucopia of high-throughput (HTP) phosphoproteomic data in the last decade, it remains unclear how many proteins are phosphorylated and how many phosphorylation sites (p-sites) can exist in total within a eukaryotic proteome. We present the first reliable estimates of the total number of phosphoproteins and p-sites for four eukaryotes (human, mouse, Arabidopsis, and yeast). Results: In all, 187 HTP phosphoproteomic datasets were filtered, compiled, and studied along with two low-throughput (LTP) compendia. Estimates of the number of phosphoproteins and p-sites were inferred by two methods: Capture-Recapture, and fitting the saturation curve of cumulative redundant vs. cumulative non-redundant phosphoproteins/p-sites. Estimates were also adjusted for different levels of noise within the individual datasets and other confounding factors. We estimate that in total, 13 000, 11 000, and 3000 phosphoproteins and 230 000, 156 000, and 40 000 p-sites exist in human, mouse, and yeast, respectively, whereas estimates for Arabidopsis were not as reliable. Conclusions: Most of the phosphoproteins have been discovered for human, mouse, and yeast, while the dataset for Arabidopsis is still far from complete. The datasets for p-sites are not as close to saturation as those for phosphoproteins. Integration of the LTP data suggests that current HTP phosphoproteomics appears to be capable of capturing 70% to 95% of total phosphoproteins, but only 40% to 60% of total p-sites.
  87. Li, Zhen, Defoort, J., Tasdighian, S., Maere, S., Van de Peer, Y., & De Smet, R. (2016). Gene duplicability of core genes is highly consistent across all angiosperms. PLANT CELL, 28(2), 326–344.
    Gene duplication is an important mechanism for adding to genomic novelty. Hence, which genes undergo duplication and are preserved following duplication is an important question. It has been observed that gene duplicability, or the ability of genes to be retained following duplication, is a nonrandom process, with certain genes being more amenable to survive duplication events than others. Primarily, gene essentiality and the type of duplication (small-scale versus large-scale) have been shown in different species to influence the (long-term) survival of novel genes. However, an overarching view of "gene duplicability" is lacking, mainly due to the fact that previous studies usually focused on individual species and did not account for the influence of genomic context and the time of duplication. Here, we present a large-scale study in which we investigated duplicate retention for 9178 gene families shared between 37 flowering plant species, referred to as angiosperm core gene families. For most gene families, we observe a strikingly consistent pattern of gene duplicability across species, with gene families being either primarily single-copy or multicopy in all species. An intermediate class contains gene families that are often retained in duplicate for periods extending to tens of millions of years after whole-genome duplication, but ultimately appear to be largely restored to singleton status, suggesting that these genes may be dosage balance sensitive. The distinction between single-copy and multicopy gene families is reflected in their functional annotation, with single-copy genes being mainly involved in the maintenance of genome stability and organelle function and multicopy genes in signaling, transport, and metabolism. The intermediate class was overrepresented in regulatory genes, further suggesting that these represent putative dosage-balance-sensitive genes.
  88. Lohaus, R., & Van de Peer, Y. (2016). Of dups and dinos : evolution at the K/Pg boundary. (Y. Van de Peer & J. C. Pires, Eds.)CURRENT OPINION IN PLANT BIOLOGY, 30, 62–69.
    Fifteen years into sequencing entire plant genomes, more than 30 paleopolyploidy events could be mapped on the tree of flowering plants (and many more when also transcriptome data sets are considered). While some genome duplications are very old and have occurred early in the evolution of dicots and monocots, or even before, others are more recent and seem to have occurred independently in many different plant lineages. Strikingly, a majority of these duplications date somewhere between 55 and 75 million years ago (mya), and thus likely correlate with the K/Pg boundary. If true, this would suggest that plants that had their genome duplicated at that time, had an increased chance to survive the most recent mass extinction event, at 66 mya, which wiped out a majority of plant and animal life, including all non-avian dinosaurs. Here, we review several processes, both neutral and adaptive, that might explain the establishment of polyploid plants, following the K/Pg mass extinction.
  89. Miclotte, G., Heydari, M., Demeester, P., Rombauts, S., Van de Peer, Y., Audenaert, P., & Fostier, J. (2016). Jabba: hybrid error correction for long sequencing reads. ALGORITHMS FOR MOLECULAR BIOLOGY, 11, 10.
    Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.
  90. Xie, Q., Tzfadia, O., Levy, M., Weithorn, E., Peled-Zehavi, H., Van Parys, T., Van de Peer, Y., et al. (2016). hfAIM: a reliable bioinformatics approach for in silico genome-wide identification of autophagy-associated Atg8-interacting motifs in various organisms. AUTOPHAGY, 12(5), 876–887.
    Most of the proteins that are specifically turned over by selective autophagy are recognized by the presence of short Atg8 interacting motifs (AIMs) that facilitate their association with the autophagy apparatus. Such AIMs can be identified by bioinformatics methods based on their defined degenerate consensus F/W/Y-X-X-L/I/V sequences in which X represents any amino acid. Achieving reliability and/or fidelity of the prediction of such AIMs on a genome-wide scale represents a major challenge. Here, we present a bioinformatics approach, high fidelity AIM (hfAIM), which uses additional sequence requirementsthe presence of acidic amino acids and the absence of positively charged amino acids in certain positionsto reliably identify AIMs in proteins. We demonstrate that the use of the hfAIM method allows for in silico high fidelity prediction of AIMs in AIM-containing proteins (ACPs) on a genome-wide scale in various organisms. Furthermore, by using hfAIM to identify putative AIMs in the Arabidopsis proteome, we illustrate a potential contribution of selective autophagy to various biological processes. More specifically, we identified 9 peroxisomal PEX proteins that contain hfAIM motifs, among which AtPEX1, AtPEX6 and AtPEX10 possess evolutionary-conserved AIMs. Bimolecular fluorescence complementation (BiFC) results verified that AtPEX6 and AtPEX10 indeed interact with Atg8 in planta. In addition, we show that mutations occurring within or nearby hfAIMs in PEX1, PEX6 and PEX10 caused defects in the growth and development of various organisms. Taken together, the above results suggest that the hfAIM tool can be used to effectively perform genome-wide in silico screens of proteins that are potentially regulated by selective autophagy. The hfAIM system is a web tool that can be accessed at link: http://bioinformatics.psb.ugent.be/hfAIM/.
  91. Jelen, V., de Jonge, R., Van de Peer, Y., Javornik, B., & Jakše, J. (2016). Complete mitochondrial genome of the Verticillium-wilt causing plant pathogen Verticillium nonalfalfae. PLOS ONE, 11(2).
    Verticillium nonalfalfae is a fungal plant pathogen that causes wilt disease by colonizing the vascular tissues of host plants. The disease induced by hop isolates of V. nonalfalfae manifests in two different forms, ranging from mild symptoms to complete plant dieback, caused by mild and lethal pathotypes, respectively. Pathogenicity variations between the causal strains have been attributed to differences in genomic sequences and perhaps also to differences in their mitochondrial genomes. We used data from our recent Illumina NGS-based project of genome sequencing V. nonalfalfae to study the mitochondrial genomes of its different strains. The aim of the research was to prepare a V. nonalfalfae reference mitochondrial genome and to determine its phylogenetic placement in the fungal kingdom. The resulting 26,139 bp circular DNA molecule contains a full complement of the 14 "standard" fungal mitochondrial protein-coding genes of the electron transport chain and ATP synthase subunits, together with a small rRNA subunit, a large rRNA subunit, which contains ribosomal protein S3 encoded within a type IA-intron and 26 tRNAs. Phylogenetic analysis of this mitochondrial genome placed it in the Verticillium spp. lineage in the Glomerellales group, which is also supported by previous phylogenetic studies based on nuclear markers. The clustering with the closely related Verticillium dahliae mitochondrial genome showed a very conserved synteny and a high sequence similarity. Two distinguishing mitochondrial genome features were also found-a potential long non-coding RNA (orf414) contained only in the Verticillium spp. of the fungal kingdom, and a specific fragment length polymorphism observed only in V. dahliae and V. nubilum of all the Verticillium spp., thus showing potential as a species specific biomarker.
  92. Olsen, J. L., Rouzé, P., Verhelst, B., Lin, Y.-C., Bayer, T., Collen, J., Dattolo, E., et al. (2016). The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. NATURE, 530(7590), 331–335.
    Seagrasses colonized the sea(1) on at least three independent occasions to form the basis of one of the most productive and widespread coastal ecosystems on the planet(2). Here we report the genome of Zostera marina (L.), the first, to our knowledge, marine angiosperm to be fully sequenced. This reveals unique insights into the genomic losses and gains involved in achieving the structural and physiological adaptations required for its marine lifestyle, arguably the most severe habitat shift ever accomplished by flowering plants. Key angiosperm innovations that were lost include the entire repertoire of stomatal genes(3), genes involved in the synthesis of terpenoids and ethylene signalling, and genes for ultraviolet protection and phytochromes for far-red sensing. Seagrasses have also regained functions enabling them to adjust to full salinity. Their cell walls contain all of the polysaccharides typical of land plants, but also contain polyanionic, low-methylated pectins and sulfated galactans, a feature shared with the cell walls of all macroalgae(4) and that is important for ion homoeostasis, nutrient uptake and O-2/CO2 exchange through leaf epidermal cells. The Z. marina genome resource will markedly advance a wide range of functional ecological studies from adaptation of marine ecosystems under climate warming(5,6), to unravelling the mechanisms of osmoregulation under high salinities that may further inform our understanding of the evolution of salt tolerance in crop plants(7).
  93. Bolton, M. D., Ebert, M. K., Faino, L., Rivera-Varas, V., de Jonge, R., Van de Peer, Y., Thomma, B. P., et al. (2016). RNA-sequencing of Cercospora beticola DMI-sensitive and -resistant isolates after treatment with tetraconazole identifies common and contrasting pathway induction. FUNGAL GENETICS AND BIOLOGY, 92, 1–13.
    Cercospora beticola causes Cercospora leaf spot of sugar beet. Cercospora leaf spot management measures often include application of the sterol demethylation inhibitor (DMI) class of fungicides. The reliance on DMIs and the consequent selection pressures imposed by their widespread use has led to the emergence of resistance in C. beticola populations. Insight into the molecular basis of tetraconazole resistance may lead to molecular tools to identify DMI-resistant strains for fungicide resistance management programs. Previous work has shown that expression of the gene encoding the DMI target enzyme (CYP51) is generally higher and inducible in DMI-resistant C beticola field strains. In this study, we extended the molecular basis of DMI resistance in this pathosystem by profiling the transcriptional response of two C. beticola strains contrasting for resistance to tetraconazole. A majority of the genes in the ergosterol biosynthesis pathway were induced to similar levels in both strains with the exception of CbCyp51, which was induced several-fold higher in the DMI-resistant strain. In contrast, a secondary metabolite gene cluster was induced in the resistance strain, but repressed in the sensitive strain. Genes encoding proteins with various cell membrane fortification processes were induced in the resistance strain. Site-directed and ectopic mutants of candidate DMI-resistance genes all resulted in significantly higher EC50 values than the wild type strain, suggesting that the cell wall and/or membrane modified as a result of the transformation process increased resistance to tetraconazole. Taken together, this study identifies important cell membrane components and provides insight into the molecular events underlying DMI resistance in C beticola.
  94. Van de Peer, Y., & Pires, J. C. (2016). Editorial overview: Genome studies and molecular genetics : of plant genes, genomes, and genomics. CURRENT OPINION IN PLANT BIOLOGY.
  95. Kaewphan, S., Van Landeghem, S., Ohta, T., Van de Peer, Y., Ginter, F., & Pyysalo, S. (2016). Cell line name recognition in support of the identification of synthetic lethality in cancer from text. BIOINFORMATICS, 32(2), 276–282.
    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.
  96. Tzfadia, O., Diels, T., De Meyer, S., Vandepoele, K., Aharoni, A., & Van de Peer, Y. (2016). CoExpNetViz: comparative co-expression networks construction and visualization tool. FRONTIERS IN PLANT SCIENCE, 6.
    Motivation: Comparative transcriptomics is a common approach in functional gene discovery efforts. It allows for finding conserved co-expression patterns between orthologous genes in closely related plant species, suggesting that these genes potentially share similar function and regulation. Several efficient co-expression-based tools have been commonly used in plant research but most of these pipelines are limited to data from model systems, which greatly limit their utility. Moreover, in addition, none of the existing pipelines allow plant researchers to make use of their own unpublished gene expression data for performing a comparative co-expression analysis and generate multi-species co-expression networks. Results: We introduce CoExpNetViz, a computational tool that uses a set of query or "bait" genes as an input (chosen by the user) and a minimum of one pre-processed gene expression dataset. The CoExpNetViz algorithm proceeds in three main steps; (i) for every bait gene submitted, co-expression values are calculated using mutual information and Pearson correlation coefficients, (ii) non bait (or target) genes are grouped based on cross-species orthology, and (iii) output files are generated and results can be visualized as network graphs in Cytoscape. Availability: The CoExpNetViz tool is freely available both as a PHP web server (link: http://bioinformatics.psb.ugent.be/webtools/coexpr/) (implemented in C++) and as a Cytoscape plugin (implemented in Java). Both versions of the CoExpNetViz tool support LINUX and Windows platforms.
  97. Van Landeghem, S., Van Parys, T., Dubois, M., Inzé, D., & Van de Peer, Y. (2016). Diffany: an ontology-driven framework to infer, visualise and analyse differential molecular networks. BMC BIOINFORMATICS, 17.
    Background: Differential networks have recently been introduced as a powerful way to study the dynamic rewiring capabilities of an interactome in response to changing environmental conditions or stimuli. Currently, such differential networks are generated and visualised using ad hoc methods, and are often limited to the analysis of only one condition-specific response or one interaction type at a time. Results: In this work, we present a generic, ontology-driven framework to infer, visualise and analyse an arbitrary set of condition-specific responses against one reference network. To this end, we have implemented novel ontology-based algorithms that can process highly heterogeneous networks, accounting for both physical interactions and regulatory associations, symmetric and directed edges, edge weights and negation. We propose this integrative framework as a standardised methodology that allows a unified view on differential networks and promotes comparability between differential network studies. As an illustrative application, we demonstrate its usefulness on a plant abiotic stress study and we experimentally confirmed a predicted regulator. Availability: Diffany is freely available as open-source java library and Cytoscape plugin from http://bioinformatics.psb.ugent.be/supplementary_data/solan/diffany/.
  98. Zhang, G.-Q., Xu, Q., Bian, C., Tsai, W.-C., Yeh, C.-M., Liu, K.-W., Yoshida, K., et al. (2016). The Dendrobium catenatum Lindl. genome sequence provides insights into polysaccharide synthase, floral development and adaptive evolution. SCIENTIFIC REPORTS, 6.
    Orchids make up about 10% of all seed plant species, have great economical value, and are of specific scientific interest because of their renowned flowers and ecological adaptations. Here, we report the first draft genome sequence of a lithophytic orchid, Dendrobium catenatum. We predict 28,910 protein-coding genes, and find evidence of a whole genome duplication shared with Phalaenopsis. We observed the expansion of many resistance-related genes, suggesting a powerful immune system responsible for adaptation to a wide range of ecological niches. We also discovered extensive duplication of genes involved in glucomannan synthase activities, likely related to the synthesis of medicinal polysaccharides. Expansion of MADS-box gene clades ANR1, StMADS11, and MIKC*, involved in the regulation of development and growth, suggests that these expansions are associated with the astonishing diversity of plant architecture in the genus Dendrobium. On the contrary, members of the type I MADS box gene family are missing, which might explain the loss of the endospermous seed. The findings reported here will be important for future studies into polysaccharide synthesis, adaptations to diverse environments and flower architecture of Orchidaceae.
  99. Yao, Yao, Marchal, K., & Van de Peer, Y. (2016). Adaptive self-organizing organisms using a bio-inspired gene regulatory network controller: for the aggregation of evolutionary robots under a changing environment. In Ying Tan (Ed.), Handbook of research on design, control and modeling of swarm robotics (pp. 68–82). Hershey, PA, USA: IGI Global.
    This work has explored the adaptive potential of simulated swarm robots that contain a genomic encoding of a bio-inspired gene regulatory network (GRN). An artificial genome is combined with a flexible agent-based system, representing the activated part of the regulatory network that transduces environmental cues into phenotypic behavior. Using an Alife simulation framework that mimics a changing environment, we have shown that separating the static from the conditionally active part of the network contributes to a better adaptive behavior. This chapter describes the biologically inspired concept of GRNs to develop a distributed robot self-organizing approach. In particular, it shows that by using this approach, multiple swarm robots can aggregate to form a robotic organism that can adapt its configuration as a response to a dynamically changing environment. In addition, through the comparison of several different simulation experiments, the results illustrate the impact of evolutionary operators such as mutations and duplications on improving the adaptability of organisms.
  100. Vlastaridis, P., Oliver, S. G., Van de Peer, Y., & Amoutzias, G. D. (2016). The challenges of interpreting phosphoproteomics data : a critical view through the bioinformatics lens. In C. Angelini, P. M. Rancoita, & S. Rovetta (Eds.), Lecture Notes in Computer Science (Vol. 9874, pp. 196–204). Presented at the 12th International meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2015), Cham, Switzerland: Springer.
    During the last decade, there has been great progress in high-throughput (HTP) phosphoproteomics and hundreds or even thousands of phosphorylation sites (p-sites) can now be detected in a single experiment. This success is attributable to a combination of very sensitive Mass Spectrometry instruments, better phosphopeptide enrichment techniques and bioinformatics software that are capable of detecting peptides and localizing p-sites. These new technologies have opened up a whole new level of gene regulation to be studied, with great potential for therapeutics and synthetic biology. Nevertheless, many challenges remain to be resolved; these concern the biases and noise of these proteomic technologies, the biological noise that is present, as well as the incompleteness of the current datasets. Despite these problems, the datasets published so far appear to represent a good sample of a complete phosphoproteome of some organisms and are capable of revealing their major properties.
  101. Perazzolli, M., Herrero, N., Sterck, L., Lenzi, L., Pellegrini, A., Puopolo, G., Van de Peer, Y., et al. (2016). Transcriptomic responses of a simplified soil microcosm to a plant pathogen and its biocontrol agent reveal a complex reaction to harsh habitat. BMC GENOMICS, 17.
    Background: Soil microorganisms are key determinants of soil fertility and plant health. Soil phytopathogenic fungi are one of the most important causes of crop losses worldwide. Microbial biocontrol agents have been extensively studied as alternatives for controlling phytopathogenic soil microorganisms, but molecular interactions between them have mainly been characterised in dual cultures, without taking into account the soil microbial community. We used an RNA sequencing approach to elucidate the molecular interplay of a soil microbial community in response to a plant pathogen and its biocontrol agent, in order to examine the molecular patterns activated by the microorganisms. Results: A simplified soil microcosm containing 11 soil microorganisms was incubated with a plant root pathogen (Armillaria mellea) and its biocontrol agent (Trichoderma atroviride) for 24 h under controlled conditions. More than 46 million paired-end reads were obtained for each replicate and 28,309 differentially expressed genes were identified in total. Pathway analysis revealed complex adaptations of soil microorganisms to the harsh conditions of the soil matrix and to reciprocal microbial competition/cooperation relationships. Both the phytopathogen and its biocontrol agent were specifically recognised by the simplified soil microcosm: defence reaction mechanisms and neutral adaptation processes were activated in response to competitive (T. atroviride) or non-competitive (A. mellea) microorganisms, respectively. Moreover, activation of resistance mechanisms dominated in the simplified soil microcosm in the presence of both A. mellea and T. atroviride. Biocontrol processes of T. atroviride were already activated during incubation in the simplified soil microcosm, possibly to occupy niches in a competitive ecosystem, and they were not further enhanced by the introduction of A. mellea. Conclusions: This work represents an additional step towards understanding molecular interactions between plant pathogens and biocontrol agents within a soil ecosystem. Global transcriptional analysis of the simplified soil microcosm revealed complex metabolic adaptation in the soil environment and specific responses to antagonistic or neutral intruders.
  102. LE, P., Makhalanyane, T. P., Guerrero, L. D., Vikram, S., Van de Peer, Y., & Cowan, D. A. (2016). Comparative metagenomic analysis reveals mechanisms for stress response in hypoliths from extreme hyperarid deserts. GENOME BIOLOGY AND EVOLUTION, 8(9), 2737–2747.
    Understanding microbial adaptation to environmental stressors is crucial for interpreting broader ecological patterns. In the most extreme hot and cold deserts, cryptic niche communities are thought to play key roles in ecosystem processes and represent excellent model systems for investigating microbial responses to environmental stressors. However, relatively little is known about the genetic diversity underlying such functional processes in climatically extreme desert systems. This study presents the first comparative metagenome analysis of cyanobacteria-dominated hypolithic communities in hot (Namib Desert, Namibia) and cold (Miers Valley, Antarctica) hyperarid deserts. The most abundant phyla in both hypolith metagenomes were Actinobacteria, Proteobacteria, Cyanobacteria and Bacteroidetes with Cyanobacteria dominating in Antarctic hypoliths. However, no significant differences between the two metagenomes were identified. The Antarctic hypolithic metagenome displayed a high number of sequences assigned to sigma factors, replication, recombination and repair, translation, ribosomal structure, and biogenesis. In contrast, the Namib Desert metagenome showed a high abundance of sequences assigned to carbohydrate transport and metabolism. Metagenome data analysis also revealed significant divergence in the genetic determinants of amino acid and nucleotide metabolism between these two metagenomes and those of soil from other polar deserts, hot deserts, and non-desert soils. Our results suggest extensive niche differentiation in hypolithic microbial communities from these two extreme environments and a high genetic capacity for survival under environmental extremes.
  103. Yao, Y., Storme, V., Marchal, K., & Van de Peer, Y. (2016). Emergent adaptive behaviour of GRN-controlled simulated robots in a changing environment. PEERJ, 4.
    We developed a bio-inspired robot controller combining an artificial genome with an agent-based control system. The genome encodes a gene regulatory network (GRN) that is switched on by environmental cues and, following the rules of transcriptional regulation, provides output signals to actuators. Whereas the genome represents the full encoding of the transcriptional network, the agent-based system mimics the active regulatory network and signal transduction system also present in naturally occurring biological systems. Using such a design that separates the static from the conditionally active part of the gene regulatory network contributes to a better general adaptive behaviour. Here, we have explored the potential of our platform with respect to the evolution of adaptive behaviour, such as preying when food becomes scarce, in a complex and changing environment and show through simulations of swarm robots in an A-life environment that evolution of collective behaviour likely can be attributed to bio-inspired evolutionary processes acting at different levels, from the gene and the genome to the individual robot and robot population.
  104. Cao, T. N. P., Greenhalgh, R., Dermauw, W., Rombauts, S., Bajda-Wybouw, S., Zhurov, V., … Clark, R. M. (2016). Complex evolutionary dynamics of massively expanded chemosensory receptor families in an extreme generalist chelicerate herbivore. GENOME BIOLOGY AND EVOLUTION, 8(11), 3323–3339.
    While mechanisms to detoxify plant produced, anti-herbivore compounds have been associated with plant host use by herbivores, less is known about the role of chemosensory perception in their life histories. This is especially true for generalists, including chelicerate herbivores that evolved herbivory independently from the more studied insect lineages. To shed light on chemosensory perception in a generalist herbivore, we characterized the chemosensory receptors (CRs) of the chelicerate two-spotted spider mite, Tetranychus urticae, an extreme generalist. Strikingly, T. urticae has more CRs than reported in any other arthropod to date. Including pseudogenes, 689 gustatory receptors were identified, as were 136 degenerin/Epithelial Na+ Channels (ENaCs) that have also been implicated as CRs in insects. The genomic distribution of T. urticae gustatory receptors indicates recurring bursts of lineage-specific proliferations, with the extent of receptor clusters reminiscent of those observed in the CR-rich genomes of vertebrates or C. elegans. Although pseudogenization of many gustatory receptors within clusters suggests relaxed selection, a subset of receptors is expressed. Consistent with functions as CRs, the genomic distribution and expression of ENaCs in lineage-specific T. urticae expansions mirrors that observed for gustatory receptors. The expansion of ENaCs in T. urticae to > 3-fold that reported in other animals was unexpected, raising the possibility that ENaCs in T. urticae have been co-opted to fulfill a major role performed by unrelated CRs in other animals. More broadly, our findings suggest an elaborate role for chemosensory perception in generalist herbivores that are of key ecological and agricultural importance.
  105. Kerchev, P., Waszczak, C., Lewandowska, A., Willems, P., Shapiguzov, A., Li, Z., … Van Breusegem, F. (2016). Lack of GLYCOLATE OXIDASE1, but not GLYCOLATE OXIDASE2, attenuates the photorespiratory phenotype of CATALASE2-deficient Arabidopsis. PLANT PHYSIOLOGY, 171(3), 1704–1719.
    The genes coding for the core metabolic enzymes of the photorespiratory pathway that allows plants with C3-type photosynthesis to survive in an oxygen-rich atmosphere, have been largely discovered in genetic screens aimed to isolate mutants that are unviable under ambient air. As an exception, glycolate oxidase (GOX) mutants with a photorespiratory phenotype have not been described yet in C3 species. Using Arabidopsis (Arabidopsis thaliana) mutants lacking the peroxisomal CATALASE2 (cat2-2) that display stunted growth and cell death lesions under ambient air, we isolated a second-site loss-of-function mutation in GLYCOLATE OXIDASE1 (GOX1) that attenuated the photorespiratory phenotype of cat2-2. Interestingly, knocking out the nearly identical GOX2 in the cat2-2 background did not affect the photorespiratory phenotype, indicating that GOX1 and GOX2 play distinct metabolic roles. We further investigated their individual functions in single gox1-1 and gox2-1 mutants and revealed that their phenotypes can be modulated by environmental conditions that increase the metabolic flux through the photorespiratory pathway. High light negatively affected the photosynthetic performance and growth of both gox1-1 and gox2-1 mutants, but the negative consequences of severe photorespiration were more pronounced in the absence of GOX1, which was accompanied with lesser ability to process glycolate. Taken together, our results point toward divergent functions of the two photorespiratory GOX isoforms in Arabidopsis and contribute to a better understanding of the photorespiratory pathway.
  106. Proost, S., Van Bel, M., Vaneechoutte, D., Van de Peer, Y., Inzé, D., Mueller-Roeber, B., & Vandepoele, K. (2015). PLAZA 3.0 : an access point for plant comparative genomics. NUCLEIC ACIDS RESEARCH, 43(D1), D974–D981.
    Comparative sequence analysis has significantly altered our view on the complexity of genome organization and gene functions in different kingdoms. PLAZA 3.0 is designed to make comparative genomics data for plants available through a user-friendly web interface. Structural and functional annotation, gene families, protein domains, phylogenetic trees and detailed information about genome organization can easily be queried and visualized. Compared with the first version released in 2009, which featured nine organisms, the number of integrated genomes is more than four times higher, and now covers 37 plant species. The new species provide a wider phylogenetic range as well as a more in-depth sampling of specific clades, and genomes of additional crop species are present. The functional annotation has been expanded and now comprises data from Gene Ontology, MapMan, UniProtKB/Swiss-Prot, PlnTFDB and PlantTFDB. Furthermore, we improved the algorithms to transfer functional annotation from well-characterized plant genomes to other species. The additional data and new features make PLAZA 3.0 (http://bioinformatics.psb.ugent.be/plaza/) a versatile and comprehensible resource for users wanting to explore genome information to study different aspects of plant biology, both in model and non-model organisms.
  107. Szakonyi, D., Van Landeghem, S., Baerenfaller, K., Baeyens, L., Blomme, J., Casanova-Sáez, R., De Bodt, S., et al. (2015). The KnownLeaf literature curation system captures knowledge about Arabidopsis leaf growth and development and facilitates integrated data mining. CURRENT PLANT BIOLOGY, 2, 1–11.
    The information that connects genotypes and phenotypes is essentially embedded in research articles written in natural language. To facilitate access to this knowledge, we constructed a framework for the curation of the scientific literature studying the molecular mechanisms that control leaf growth and development in Arabidopsis thaliana (Arabidopsis). Standard structured statements, called relations, were designed to capture diverse data types, including phenotypes and gene expression linked to genotype description, growth conditions, genetic and molecular interactions, and details about molecular entities. Relations were then annotated from the literature, defining the relevant terms according to standard biomedical ontologies. This curation process was supported by a dedicated graphical user interface, called Leaf Knowtator. A total of 283 primary research articles were curated by a community of annotators, yielding 9947 relations monitored for consistency and over 12,500 references to Arabidopsis genes. This information was converted into a relational database (KnownLeaf) and merged with other public Arabidopsis resources relative to transcriptional networks, protein–protein interaction, gene co-expression, and additional molecular annotations. Within KnownLeaf, leaf phenotype data can be searched together with molecular data originating either from this curation initiative or from external public resources. Finally, we built a network (LeafNet) with a portion of the KnownLeaf database content to graphically represent the leaf phenotype relations in a molecular context, offering an intuitive starting point for knowledge mining. Literature curation efforts such as ours provide high quality structured information accessible to computational analysis, and thereby to a wide range of applications. DATA: The presented work was performed in the framework of the AGRON-OMICS project (Arabidopsis GRO wth Network integrating OMICS technologies) supported by European Commission 6th Framework Programme project (Grant number LSHG-CT-2006-037704). This is a data integration and data sharing portal collecting all the all the major results from the consortium. All data presented in our paper is available here. https://agronomics.ethz.ch/.
  108. Cai, J., Liu, X., Vanneste, K., Proost, S., Tsai, W.-C., Liu, K.-W., Chen, L.-J., et al. (2015). The genome sequence of the orchid Phalaenopsis equestris. NATURE GENETICS, 47(1), 65–72.
    Orchidaceae, renowned for its spectacular flowers and other reproductive and ecological adaptations, is one of the most diverse plant families. Here we present the genome sequence of the tropical epiphytic orchid Phalaenopsis equestris, a frequently used parent species for orchid breeding. P. equestris is the first plant with crassulacean acid metabolism (CAM) for which the genome has been sequenced. Our assembled genome contains 29,431 predicted protein-coding genes. We find that contigs likely to be underassembled, owing to heterozygosity, are enriched for genes that might be involved in self-incompatibility pathways. We find evidence for an orchid-specific paleopolyploidy event that preceded the radiation of most orchid clades, and our results suggest that gene duplication might have contributed to the evolution of CAM photosynthesis in P. equestris. Finally, we find expanded and diversified families of MADS-box C/D-class, B-class AP3 and AGL6-class genes, which might contribute to the highly specialized morphology of orchid flowers.
  109. Ranade, S. S., Lin, Y.-C., Van de Peer, Y., & García-Gil, M. R. (2015). Comparative in silico analysis of SSRs in coding regions of high confidence predicted genes in Norway spruce (Picea abies) and Loblolly pine (Pinus taeda). BMC GENETICS, 16.
    Background: Microsatellites or simple sequence repeats (SSRs) are DNA sequences consisting of 1-6 bp tandem repeat motifs present in the genome. SSRs are considered to be one of the most powerful tools in genetic studies. We carried out a comparative study of perfect SSR loci belonging to class I (>= 20) and class II (>= 12 and < 20 bp) types located in coding regions of high confidence genes in Picea abies and Pinus taeda. SSRLocator was used to retrieve SSRs from the full length CDS of predicted genes in both species. Results: Trimers were the most abundant motifs in class I followed by hexamers in Picea abies, while trimers and hexamers were equally abundant in Pinus taeda class I SSRs. Hexamers were most frequent within class II SSRs followed by trimers, in both species. Although the frequency of genes containing SSRs was slightly higher in Pinus taeda, SSR counts per Mbp for class I was similar in both species (P-value = 0.22); while for class II SSRs, it was significantly higher in Picea abies (P-value = 0.00009). AT-rich motifs were higher in abundance than the GC-rich motifs, within class II SSRs in both the species (P-values = 10(-9) and 0). With reference to class I SSRs, AT-rich and GC-rich motifs were detected with equal frequency in Pinus taeda (P-value = 0.24); while in Picea abies, GC-rich motifs were detected with higher frequency than the AT-rich motifs (P-value = 0.0005). Conclusions: Our study gives a comparative overview of the genome SSRs composition based on high confidence genes in the two recently sequenced and economically important conifers and, also provides information on functional molecular markers that can be applied in genetic studies in Pinus and Picea species.
  110. Crauwels, S., Van Assche, A., de Jonge, R., Borneman, A., Verreth, C., Troels, P., De Samblanx, G., et al. (2015). Comparative phenomics and targeted use of genomics reveals variation in carbon and nitrogen assimilation among different Brettanomyces bruxellensis strains. APPLIED MICROBIOLOGY AND BIOTECHNOLOGY, 99(21), 9123–9134.
    Recent studies have suggested a correlation between genotype groups of Brettanomyces bruxellensis and their source of isolation. To further explore this relationship, the objective of this study was to assess metabolic differences in carbon and nitrogen assimilation between different B. bruxellensis strains from three beverages, including beer, wine, and soft drink, using Biolog Phenotype Microarrays. While some similarities of physiology were noted, many traits were variable among strains. Interestingly, some phenotypes were found that could be linked to strain origin, especially for the assimilation of particular alpha- and beta-glycosides as well as alpha- and beta-substituted monosaccharides. Based upon gene presence or absence, an alpha-glucosidase and beta-glucosidase were found explaining the observed phenotypes. Further, using a PCR screen on a large number of isolates, we have been able to specifically link a genomic deletion to the beer strains, suggesting that this region may have a fitness cost for B. bruxellensis in certain fermentation systems such as brewing. More specifically, none of the beer strains were found to contain a beta-glucosidase, which may have direct impacts on the ability for these strains to compete with other microbes or on flavor production.
  111. Sundell, D., Mannapperuma, C., Netotea, S., Delhomme, N., Lin, Y.-C., Sjödin, A., Van de Peer, Y., et al. (2015). The plant genome integrative explorer resource : PlantGenIE.org. NEW PHYTOLOGIST, 208(4), 1149–1156.
    Accessing and exploring large-scale genomics data sets remains a significant challenge to researchers without specialist bioinformatics training. We present the integrated PlantGenIE.org platform for exploration of Populus, conifer and Arabidopsis genomics data, which includes expression networks and associated visualization tools. Standard features of a model organism database are provided, including genome browsers, gene list annotation, BLAST homology searches and gene information pages. Community annotation updating is supported via integration of WebApollo. We have produced an RNA-sequencing (RNA-Seq) expression atlas for Populus tremula and have integrated these data within the expression tools. An updated version of the COMPLEX resource for performing comparative plant expression analyses of gene coexpression network conservation between species has also been integrated. The PlantGenIE.org platform provides intuitive access to large-scale and genome-wide genomics data from model forest tree species, facilitating both community contributions to annotation improvement and tools supporting use of the included data resources to inform biological insight.
  112. Soltis, P. S., Marchant, D. B., Van de Peer, Y., & Soltis, D. E. (2015). Polyploidy and genome evolution in plants. CURRENT OPINION IN GENETICS & DEVELOPMENT, 35, 119–125.
    Plant genomes vary in size and complexity, fueled in part by processes of whole-genome duplication (WGD; polyploidy) and subsequent genome evolution. Despite repeated episodes of WGD throughout the evolutionary history of angiosperms in particular, the genomes are not uniformly large, and even plants with very small genomes carry the signatures of ancient duplication events. The processes governing the evolution of plant genomes following these ancient events are largely unknown. Here, we consider mechanisms of diploidization, evidence of genome reorganization in recently formed polyploid species, and macroevolutionary patterns of WGD in plant genomes and propose that the ongoing genomic changes observed in recent polyploids may illustrate the diploidization processes that result in ancient signatures of WGD over geological timescales.
  113. Delhomme, N., Sundstrom, G., Zamani, N., Lantz, H., Lin, Y.-C., Hvidsten, T. R., Hoppner, M. P., et al. (2015). Serendipitous meta-transcriptomics : the fungal community of Norway spruce (Picea abies). PLOS ONE, 10(9).
    After performing de novo transcript assembly of >1 billion RNA-Sequencing reads obtained from 22 samples of different Norway spruce (Picea abies) tissues that were not surface sterilized, we found that assembled sequences captured a mix of plant, lichen, and fungal transcripts. The latter were likely expressed by endophytic and epiphytic symbionts, indicating that these organisms were present, alive, and metabolically active. Here, we show that these serendipitously sequenced transcripts need not be considered merely as contamination, as is common, but that they provide insight into the plant's phyllosphere. Notably, we could classify these transcripts as originating predominantly from Dothideomycetes and Leotiomycetes species, with functional annotation of gene families indicating active growth and metabolism, with particular regards to glucose intake and processing, as well as gene regulation.
  114. Van den Berge, K., De Smet, R., Van de Peer, Y., & Clement, L. (2015). Quantifying expression divergence of duplicated genes with microarrays. Belgian Statistical Society, 23rd Annual meeting, Abstracts. Presented at the 23rd Annual meeting of the Belgian Statistical Society.
    Whole genome duplication (WGD) events are widespread among flowering plants. They result in two redundant genomes within the individual. Most duplicated genes derived from a WGD event (i.e. homeologous genes) will get lost during evolution. Nonetheless, they provide raw material for the evolution of genes with novel functions. Expression divergence is often used to assess the contribution of WGD in this respect. Microarray technology can be used for this purpose. With microarrays, the expression of a gene is measured by multiple 'probes', i.e. a probeset. Quantifying expression divergence involves differential expression analysis between two distinct genes, which is challenging as it involves different probesets, each having different characteristics. We show that standard analysis methods adopted in the evolutionary genomics literature typically lead to an excess of false positives, explaining the high number of reported significantly diverged genes. We propose a novel data analysis strategy to account for these probe effects. An empirical null distribution is established by adopting a test statistic on probes within a probeset. This null distribution can be incorporated in a local fdr estimate for every gene pair, which rigorously defines significant expression divergence. We illustrate our method in a case study on Arabidopsis thaliana.
  115. Vanneste, Kevin, Sterck, L., Myburg, A. A., Van de Peer, Y., & Mizrachi, E. (2015). Horsetails are ancient polyploids : evidence from Equisetum giganteum. PLANT CELL, 27(6), 1567–1578.
    Horsetails represent an enigmatic clade within the land plants. Despite consisting only of one genus (Equisetum) that contains 15 species, they are thought to represent the oldest extant genus within the vascular plants dating back possibly as far as the Triassic. Horsetails have retained several ancient features and are also characterized by a particularly high chromosome count (n = 108). Whole-genome duplications (WGDs) have been uncovered in many angiosperm clades and have been associated with the success of angiosperms, both in terms of species richness and biomass dominance, but remain understudied in nonangiosperm clades. Here, we report unambiguous evidence of an ancient WGD in the fern linage, based on sequencing and de novo assembly of an expressed gene catalog (transcriptome) from the giant horsetail (Equisetum giganteum). We demonstrate that horsetails underwent an independent paleopolyploidy during the Late Cretaceous prior to the diversification of the genus but did not experience any recent polyploidizations that could account for their high chromosome number. We also discuss the specific retention of genes following the WGD and how this may be linked to their long-term survival.
  116. Zhang, Zhonghua, Mao, L., Chen, H., Bu, F., Li, G., Sun, J., Li, S., et al. (2015). Genome-wide mapping of structural variations reveals a copy number variant that determines reproductive morphology in cucumber. PLANT CELL, 27(6), 1595–1604.
    Structural variations (SVs) represent a major source of genetic diversity. However, the functional impact and formation mechanisms of SVs in plant genomes remain largely unexplored. Here, we report a nucleotide-resolution SV map of cucumber (Cucumis sativas) that comprises 26,788 SVs based on deep resequencing of 115 diverse accessions. The largest proportion of cucumber SVs was formed through nonhomologous end-joining rearrangements, and the occurrence of SVs is closely associated with regions of high nucleotide diversity. These SVs affect the coding regions of 1676 genes, some of which are associated with cucumber domestication. Based on the map, we discovered a copy number variation (CNV) involving four genes that defines the Female (F) locus and gives rise to gynoecious cucumber plants, which bear only female flowers and set fruit at almost every node. The CNV arose from a recent 30.2-kb duplication at a meiotically unstable region, likely via microhomology-mediated break-induced replication. The SV set provides a snapshot of structural variations in plants and will serve as an important resource for exploring genes underlying key traits and for facilitating practical breeding in cucumber.
  117. Potenza, E., Racchi, M. L., Sterck, L., Coller, E., Asquini, E., Tosatto, S. C., Velasco, R., et al. (2015). Exploration of alternative splicing events in ten different grapevine cultivars. BMC GENOMICS, 16.
    Background: The complex dynamics of gene regulation in plants are still far from being fully understood. Among many factors involved, alternative splicing (AS) in particular is one of the least well documented. For many years, AS has been considered of less relevant in plants, especially when compared to animals, however, since the introduction of next generation sequencing techniques the number of plant genes believed to be alternatively spliced has increased exponentially. Results: Here, we performed a comprehensive high-throughput transcript sequencing of ten different grapevine cultivars, which resulted in the first high coverage atlas of the grape berry transcriptome. We also developed findAS, a software tool for the analysis of alternatively spliced junctions. We demonstrate that at least 44 % of multi-exonic genes undergo AS and a large number of low abundance splice variants is present within the 131.622 splice junctions we have annotated from Pinot noir. Conclusions: Our analysis shows that similar to 70 % of AS events have relatively low expression levels, furthermore alternative splice sites seem to be enriched near the constitutive ones in some extent showing the noise of the splicing mechanisms. However, AS seems to be extensively conserved among the 10 cultivars.
  118. De La Torre, A. R., Lin, Y.-C., Van de Peer, Y., & Ingvarsson, P. K. (2015). Genome-wide analysis reveals diverged patterns of codon bias, gene expression, and rates of sequence evolution in Picea gene families. GENOME BIOLOGY AND EVOLUTION, 7(4), 1002–1015.
    The recent sequencing of several gymnosperm genomes has greatly facilitated studying the evolution of their genes and gene families. In this study, we examine the evidence for expression-mediated selection in the first two fully sequenced representatives of the gymnosperm plant clade (Picea abies and Picea glauca). We use genome-wide estimates of gene expression (> 50,000 expressed genes) to study the relationship between gene expression, codon bias, rates of sequence divergence, protein length, and gene duplication. We found that gene expression is correlated with rates of sequence divergence and codon bias, suggesting that natural selection is acting on Picea protein-coding genes for translational efficiency. Gene expression, rates of sequence divergence, and codon bias are correlated with the size of gene families, with large multicopy gene families having, on average, a lower expression level and breadth, lower codon bias, and higher rates of sequence divergence than single-copy gene families. Tissue-specific patterns of gene expression were more common in large gene families with large gene expression divergence than in single-copy families. Recent family expansions combined with large gene expression variation in paralogs and increased rates of sequence evolution suggest that some Picea gene families are rapidly evolving to cope with biotic and abiotic stress. Our study highlights the importance of gene expression and natural selection in shaping the evolution of protein-coding genes in Picea species, and sets the ground for further studies investigating the evolution of individual gene families in gymnosperms.
  119. De Tiège, A., Tanghe, K., Braeckman, J., & Van de Peer, Y. (2015). Life’s dual nature: a way out of the impasse of the gene-centred “versus” complex systems controversy on life. In P. Pontarotti (Ed.), Evolutionary biology : biodiversification from genotype to phenotype (pp. 35–52). Berlin, Germany: Springer.
    Living cells and organisms are complex physical systems. Does their organization or complexity primarily rely on the intra-molecular crystalline structure of genetic nucleic acid sequences? Or is it, as critics of the ‘gene-centred’ perspective claim, predominantly a result of the inter- and supra-molecular – thus ‘holistic’ – network dynamics of genetic and various extra-genetic factors? The twentieth-century successes in several branches of genetics caused intensive focus on the causal role of genes in the biochemistry, development and evolution of living organisms, resulting in a relative abstraction or even neglect of life’s complex systems dynamics. Today, however, partly due to the success of systems biology, a number of authors defend life’s systems complexity while criticizing the gene-centred approach. Here, we offer a way out of the impasse of the gene-centred ‘versus’ complex systems perspective to arrive at a more balanced and complete understanding of life’s multifaceted nature. After sketching the conceptual and historical background of the controversy, we show how the present state of knowledge in biology vindicates both the holistically complex and gene-centred nature of life on Earth, but decisively falsifies extreme genetic ‘determinism’ and ‘reductionism’ as well as extreme ‘gene-de-centrism’. Contrary to what is often claimed, the fact that genes are one among many extra-genetic causal factors contributing to the biochemistry and development of cells and organisms, only undermines or falsifies genetic determinism and reductionism but not necessarily gene-centrism. Some implications for evolutionary theory, i.e., for the controversy between the Modern Synthesis and an ‘Extended Synthesis’, are outlined.
  120. Morel, G., Sterck, L., Swennen, D., Marcet-Houben, M., Onesime, D., Levasseur, A., Jacques, N., et al. (2015). Differential gene retention as an evolutionary mechanism to generate biodiversity and adaptation in yeasts. SCIENTIFIC REPORTS, 5.
    The evolutionary history of the characters underlying the adaptation of microorganisms to food and biotechnological uses is poorly understood. We undertook comparative genomics to investigate evolutionary relationships of the dairy yeast Geotrichum candidum within Saccharomycotina. Surprisingly, a remarkable proportion of genes showed discordant phylogenies, clustering with the filamentous fungus subphylum (Pezizomycotina), rather than the yeast subphylum (Saccharomycotina), of the Ascomycota. These genes appear not to be the result of Horizontal Gene Transfer (HGT), but to have been specifically retained by G. candidum after the filamentous fungiyeasts split concomitant with the yeasts' genome contraction. We refer to these genes as SRAGs (Specifically Retained Ancestral Genes), having been lost by all or nearly all other yeasts, and thus contributing to the phenotypic specificity of lineages. SRAG functions include lipases consistent with a role in cheese making and novel endoglucanases associated with degradation of plant material. Similar gene retention was observed in three other distantly related yeasts representative of this ecologically diverse subphylum. The phenomenon thus appears to be widespread in the Saccharomycotina and argues that, alongside neo-functionalization following gene duplication and HGT, specific gene retention must be recognized as an important mechanism for generation of biodiversity and adaptation in yeasts.
  121. Ghorbani, S., Lin, Y.-C., Parizot, B., Fernandez Salina, A., Njo, M., Van de Peer, Y., Beeckman, T., et al. (2015). Expanding the repertoire of secretory peptides controlling root development with comparative genome analysis and functional assays. JOURNAL OF EXPERIMENTAL BOTANY, 66(17), 5257–5269.
    Plant genomes encode numerous small secretory peptides (SSPs) whose functions have yet to be explored. Based on structural features that characterize SSP families known to take part in postembryonic development, this comparative genome analysis resulted in the identification of genes coding for oligopeptides potentially involved in cell-to-cell communication. Because genome annotation based on short sequence homology is difficult, the criteria for the de novo identification and aggregation of conserved SSP sequences were first benchmarked across five reference plant species. The resulting gene families were then extended to 32 genome sequences, including major crops. The global phylogenetic pattern common to the functionally characterized SSP families suggests that their apparition and expansion coincide with that of the land plants. The SSP families can be searched online for members, sequences and consensus (http://bioinformatics.psb.ugent.be/webtools/PlantSSP/). Looking for putative regulators of root development, Arabidopsis thaliana SSP genes were further selected through transcriptome meta-analysis based on their expression at specific stages and in specific cell types in the course of the lateral root formation. As an additional indication that formerly uncharacterized SSPs may control development, this study showed that root growth and branching were altered by the application of synthetic peptides matching conserved SSP motifs, sometimes in very specific ways. The strategy used in the study, combining comparative genomics, transcriptome meta-analysis and peptide functional assays in planta, pinpoints factors potentially involved in non-cell-autonomous regulatory mechanisms. A similar approach can be implemented in different species for the study of a wide range of developmental programmes.
  122. Saltykova, A., Pulido-Tamayo, S., Pazoutova, M., Rensing, S. A., Nishiyama, T., Van de Peer, Y., Marchal, K., et al. (2015). Identifying prokaryotic consortia that live in close interactions with algae. EUROPEAN JOURNAL OF PHYCOLOGY (Vol. 50, pp. 145–146). Presented at the 6th Euopean Phycological congress.
  123. De Tiège, A., Tanghe, K., Braeckman, J., & Van de Peer, Y. (2014). From DNA- to NA-centrism and the conditions for gene-centrism revisited. BIOLOGY & PHILOSOPHY, 29(1), 55–69.
    First the 'Weismann barrier' and later on Francis Crick's 'central dogma' of molecular biology nourished the gene-centric paradigm of life, i.e., the conception of the gene/genome as a 'central source' from which hereditary specificity unidirectionally flows or radiates into cellular biochemistry and development. Today, due to advances in molecular genetics and epigenetics, such as the discovery of complex post-genomic and epigenetic processes in which genes are causally integrated, many theorists argue that a gene-centric conception of the organism has become problematic. Here, we first explore the causal implications of the following two central dogma-related issues: (1) widespread reverse transcription-arguing for an extension from 'DNA-genome' to RNA-encompassing 'NA-genome' and, thus, from traditional DNA-centrism to a broader 'NA-centrism'; and (2) the absence of a mechanism of reverse translation-arguing for the 'structural primacy' of NA-sequence over protein in cellular biochemistry. Secondly, we explore whether this latter conclusion can be extended to a 'functional primacy' of NA-sequence over protein in cellular biochemistry, which would imply a limited kind of 'gene/NA-centrism' confined to the subcellular level of NA/protein-based biochemistry. Finally, we explore the conditions-and their (non)fulfilment-for a more generalised form of gene-centrism extendable to higher levels of biological organisation. We conclude that the higher we go in the biological hierarchy, the more dubious gene-centric claims become.
  124. Myburg, A. A., Grattapaglia, D., Tuskan, G. A., Hellsten, U., Hayes, R. D., Grimwood, J., Jenkins, J., et al. (2014). The genome of Eucalyptus grandis. NATURE, 510(7505), 356–362.
    Eucalypts are the world's most widely planted hardwood trees. Their outstanding diversity, adaptability and growth have made them a global renewable resource of fibre and energy. We sequenced and assembled >94% of the 640-megabase genome of Eucalyptus grandis. Of 36,376 predicted protein-coding genes, 34% occur in tandem duplications, the largest proportion thus far in plant genomes. Eucalyptus also shows the highest diversity of genes for specialized metabolites such as terpenes that act as chemical defence and provide unique pharmaceutical oils. Genome sequencing of the E. grandis sister species E. globulus and a set of inbred E. grandis tree genomes reveals dynamic genome evolution and hotspots of inbreeding depression. The E. grandis genome is the first reference for the eudicot order Myrtales and is placed here sister to the eurosids. This resource expands our understanding of the unique biology of large woody perennials and provides a powerful tool to accelerate comparative biology, breeding and biotechnology.
  125. Vanneste, Kevin, Maere, S., & Van de Peer, Y. (2014). Tangled up in two: a burst of genome duplications at the end of the Cretaceous and the consequences for plant evolution. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 369(1648).
    Genome sequencing has demonstrated that besides frequent small-scale duplications, large-scale duplication events such as whole genome duplications (WGDs) are found on many branches of the evolutionary tree of life. Especially in the plant lineage, there is evidence for recurrent WGDs, and the ancestor of all angiosperms was in fact most likely a polyploid species. The number of WGDs found in sequenced plant genomes allows us to investigate questions about the roles of WGDs that were hitherto impossible to address. An intriguing observation is that many plant WGDs seem associated with periods of increased environmental stress and/or fluctuations, a trend that is evident for both present-day polyploids and palaeopolyploids formed around the Cretaceous-Palaeogene (K-Pg) extinction at 66 Ma. Here, we revisit the WGDs in plants that mark the K-Pg boundary, and discuss some specific examples of biological innovations and/or diversifications that may be linked to these WGDs. We review evidence for the processes that could have contributed to increased polyploid establishment at the K-Pg boundary, and discuss the implications on subsequent plant evolution in the Cenozoic.
  126. Ciesielska, K., Van Bogaert, I., Chevineau, S., Li, B., Groeneboer, S., Soetaert, W., Van de Peer, Y., et al. (2014). Exoproteome analysis of Starmerella bombicola results in the discovery of an esterase required for lactonization of sophorolipids. JOURNAL OF PROTEOMICS, 98, 159–174.
  127. Blanc-Mathieu, R., Verhelst, B., Derelle, E., Rombauts, S., Bouget, F.-Y., Carre, I., Chateau, A., et al. (2014). An improved genome of the model marine alga Ostreococcus tauri unfolds by assessing Illumina de novo assemblies. BMC GENOMICS, 15.
    Background: Cost effective next generation sequencing technologies now enable the production of genomic datasets for many novel planktonic eukaryotes, representing an understudied reservoir of genetic diversity. O. tauri is the smallest free-living photosynthetic eukaryote known to date, a coccoid green alga that was first isolated in 1995 in a lagoon by the Mediterranean sea. Its simple features, ease of culture and the sequencing of its 13 Mb haploid nuclear genome have promoted this microalga as a new model organism for cell biology. Here, we investigated the quality of genome assemblies of Illumina GAIIx 75 bp paired end reads from Ustreococcus touri, thereby also improving the existing assembly and showing the genome to be stably maintained in culture. Results: The 3 assemblers used, ABySS, CLCBio and Velvet, produced 95% complete genomes in 1402 to 2080 scaffolds with a very low rate of misassembly. Reciprocally, these assemblies improved the original genome assembly by filling in 930 gaps. Combined with additional analysis of raw reads and PCR sequencing effort, 1194 gaps have been solved in total adding up to 460 kb of sequence. Mapping of RNAseq II lumina data on this updated genome led to a twofold reduction in the proportion of multi-exon protein coding genes, representing 19% of the total 7699 protein coding genes. The comparison of the DNA extracted in 2001 and 2009 revealed the fixation of 8 single nucleotide substitutions and 2 deletions during the approximately 6000 generations in the lab. The deletions either knocked out or truncated two predicted transmembrane proteins, including a glutamate receptor like gene. Conclusion: High coverage (>80 fold) paired end Illumina sequencing enables a high quality 95% complete genome assembly of a compact 13 Mb haploid eukaryote. This genome sequence has remained stable for 6000 generations of lab culture.
  128. Vanneste, Kevin, Baele, G., Maere, S., & Van de Peer, Y. (2014). Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous-Paleogene boundary. GENOME RESEARCH, 24(8), 1334–1347.
    Ancient whole-genome duplications (WGDs), also referred to as paleopolyploidizations, have been reported in most evolutionary lineages. Their attributed role remains a major topic of discussion, ranging from an evolutionary dead end to a road toward evolutionary success, with evidence supporting both fates. Previously, based on dating WGDs in a limited number of plant species, we found a clustering of angiosperm paleopolyploidizations around the Cretaceous Paleogene (K-Pg) extinction event about 66 million years ago. Here we revisit this finding, which has proven controversial, by combining genome sequence information for many more plant lineages and using more sophisticated analyses. We include 38 full genome sequences and three transcriptome assemblies in a Bayesian evolutionary analysis framework that incorporates uncorrelated relaxed clock methods and fossil uncertainty. In accordance with earlier findings, we demonstrate a strongly nonrandom pattern of genome duplications over time with many WGDs clustering around the K-Pg boundary. We interpret these results in the context of recent studies on invasive polyploid plant species, and suggest that polyploid establishment is promoted during times of environmental stress. We argue that considering the evolutionary potential of polyploids in light of the environmental and ecological conditions present around the time of polyploidization could mitigate the stark contrast in the proposed evolutionary fates of polyploids.
  129. Mushthofa, M., Torres Torres, G. A., Van de Peer, Y., Marchal, K., & De Cock, M. (2014). ASP-G: an ASP-based method for finding attractors in genetic regulatory networks. BIOINFORMATICS, 30(21), 3086–3092.
    Motivation: Boolean network models are suitable to simulate GRNs in the absence of detailed kinetic information. However, reducing the biological reality implies making assumptions on how genes interact (interaction rules) and how their state is updated during the simulation (update scheme). The exact choice of the assumptions largely determines the outcome of the simulations. In most cases, however, the biologically correct assumptions are unknown. An ideal simulation thus implies testing different rules and schemes to determine those that best capture an observed biological phenomenon. This is not trivial because most current methods to simulate Boolean network models of GRNs and to compute their attractors impose specific assumptions that cannot be easily altered, as they are built into the system. Results: To allow for a more flexible simulation framework, we developed ASP-G. We show the correctness of ASP-G in simulating Boolean network models and obtaining attractors under different assumptions by successfully recapitulating the detection of attractors of previously published studies. We also provide an example of how performing simulation of network models under different settings help determine the assumptions under which a certain conclusion holds. The main added value of ASP-G is in its modularity and declarativity, making it more flexible and less error-prone than traditional approaches. The declarative nature of ASP-G comes at the expense of being slower than the more dedicated systems but still achieves a good efficiency with respect to computational time. Availability and implementation: The source code of ASP-G is available at http://bioinformatics.intec.ugent.be/kmarchal/Supplementary_Information_Musthofa_2014/asp-g.zip.
  130. Ranade, S. S., Lin, Y.-C., Zuccolo, A., Van de Peer, Y., & Garcia-Gil, M. del R. (2014). Comparative in silico analysis of EST-SSRs in angiosperm and gymnosperm tree genera. BMC PLANT BIOLOGY, 14.
    Background: Simple Sequence Repeats (SSRs) derived from Expressed Sequence Tags (ESTs) belong to the expressed fraction of the genome and are important for gene regulation, recombination, DNA replication, cell cycle and mismatch repair. Here, we present a comparative analysis of the SSR motif distribution in the 5'UTR, ORF and 3'UTR fractions of ESTs across selected genera of woody trees representing gymnosperms (17 species from seven genera) and angiosperms (40 species from eight genera). Results: Our analysis supports a modest contribution of EST-SSR length to genome size in gymnosperms, while EST-SSR density was not associated with genome size in neither angiosperms nor gymnosperms. Multiple factors seem to have contributed to the lower abundance of EST-SSRs in gymnosperms that has resulted in a non-linear relationship with genome size diversity. The AG/CT motif was found to be the most abundant in SSRs of both angiosperms and gymnosperms, with a relative increase in AT/AT in the latter. Our data also reveals a higher abundance of hexamers across the gymnosperm genera. Conclusions: Our analysis provides the foundation for future comparative studies at the species level to unravel the evolutionary processes that control the SSR genesis and divergence between angiosperm and gymnosperm tree species.
  131. Vermeirssen, V., De Clercq, I., Van Parys, T., Van Breusegem, F., & Van de Peer, Y. (2014). Arabidopsis ensemble reverse-engineered gene regulatory network discloses interconnected transcription factors in oxidative stress. PLANT CELL, 26(12), 4656–4679.
    The abiotic stress response in plants is complex and tightly controlled by gene regulation. We present an abiotic stress gene regulatory network of 200,014 interactions for 11,938 target genes by integrating four complementary reverse-engineering solutions through average rank aggregation on an Arabidopsis thaliana microarray expression compendium. This ensemble performed the most robustly in benchmarking and greatly expands upon the availability of interactions currently reported. Besides recovering 1182 known regulatory interactions, cis-regulatory motifs and coherent functionalities of target genes corresponded with the predicted transcription factors. We provide a valuable resource of 572 abiotic stress modules of coregulated genes with functional and regulatory information, from which we deduced functional relationships for 1966 uncharacterized genes and many regulators. Using gain-and loss-of-function mutants of seven transcription factors grown under control and salt stress conditions, we experimentally validated 141 out of 271 predictions (52% precision) for 102 selected genes and mapped 148 additional transcription factor-gene regulatory interactions (49% recall). We identified an intricate core oxidative stress regulatory network where NAC13, NAC053, ERF6, WRKY6, and NAC032 transcription factors interconnect and function in detoxification. Our work shows that ensemble reverse-engineering can generate robust biological hypotheses of gene regulation in a multicellular eukaryote that can be tested by medium-throughput experimental validation.
  132. Ahmed, S., Cock, J. M., Pessia, E., Luthringer, R., Cormier, A., Robuchon, M., Sterck, L., et al. (2014). A haploid system of sex determination in the brown alga Ectocarpus sp. CURRENT BIOLOGY, 24(17), 1945–1957.
    Background: A common feature of most genetic sex-determination systems studied so far is that sex is determined by nonrecombining genomic regions, which can be of various sizes depending on the species. These regions have evolved independently and repeatedly across diverse groups. A number of such sex-determining regions (SDRs) have been studied in animals, plants, and fungi, but very little is known about the evolution of sexes in other eukaryotic lineages. Results: We report here the sequencing and genomic analysis of the SDR of Ectocarpus, a brown alga that has been evolving independently from plants, animals, and fungi for over one giga-annum. In Ectocarpus, sex is expressed during the haploid phase of the life cycle, and both the female (U) and the male (V) sex chromosomes contain nonrecombining regions. The U and V of this species have been diverging for more than 70 mega-annum, yet gene degeneration has been modest, and the SDR is relatively small, with no evidence for evolutionary strata. These features may be explained by the occurrence of strong purifying selection during the haploid phase of the life cycle and the low level of sexual dimorphism. V is dominant over U, suggesting that femaleness may be the default state, adopted when the male haplotype is absent. Conclusions: The Ectocarpus UV system has clearly had a distinct evolutionary trajectory not only to the well-studied XY and ZW systems but also to the UV systems described so far. Nonetheless, some striking similarities exist, indicating remarkable universality of the underlying processes shaping sex chromosome evolution across distant lineages.
  133. Pajoro, A., Biewers, S., Dougali, E., Valentim, F. L., Mendes, M. A., Porri, A., Coupland, G., et al. (2014). The (r)evolution of gene regulatory networks controlling Arabidopsis plant reproduction: a two-decade history. JOURNAL OF EXPERIMENTAL BOTANY, 65(17), 4731–4745.
    Successful plant reproduction relies on the perfect orchestration of singular processes that culminate in the product of reproduction: the seed. The floral transition, floral organ development, and fertilization are well-studied processes and the genetic regulation of the various steps is being increasingly unveiled. Initially, based predominantly on genetic studies, the regulatory pathways were considered to be linear, but recent genome-wide analyses, using high-throughput technologies, have begun to reveal a different scenario. Complex gene regulatory networks underlie these processes, including transcription factors, microRNAs, movable factors, hormones, and chromatin-modifying proteins. Here we review recent progress in understanding the networks that control the major steps in plant reproduction, showing how new advances in experimental and computational technologies have been instrumental. As these recent discoveries were obtained using the model species Arabidopsis thaliana, we will restrict this review to regulatory networks in this important model species. However, more fragmentary information obtained from other species reveals that both the developmental processes and the underlying regulatory networks are largely conserved, making this review also of interest to those studying other plant species.
  134. Lin, Y.-C., Boone, M., Meuris, L., Lemmens, I., Van Roy, N., Soete, A., Reumers, J., et al. (2014). Genome dynamics of the human embryonic kidney 293 lineage in response to cell biology manipulations. NATURE COMMUNICATIONS, 5.
    The HEK293 human cell lineage is widely used in cell biology and biotechnology. Here we use whole-genome resequencing of six 293 cell lines to study the dynamics of this aneuploid genome in response to the manipulations used to generate common 293 cell derivatives, such as transformation and stable clone generation (293T); suspension growth adaptation (293S); and cytotoxic lectin selection (293SG). Remarkably, we observe that copy number alteration detection could identify the genomic region that enabled cell survival under selective conditions (i.c. ricin selection). Furthermore, we present methods to detect human/vector genome breakpoints and a user-friendly visualization tool for the 293 genome data. We also establish that the genome structure composition is in steady state for most of these cell lines when standard cell culturing conditions are used. This resource enables novel and more informed studies with 293 cells, and we will distribute the sequenced cell lines to this effect.
  135. Chaves, I., Lin, Y.-C., Pinto-Ricardo, C., Van de Peer, Y., & Miguel, C. (2014). miRNA profiling in leaf and cork tissues of Quercus suber reveals novel miRNAs and tissue-specific expression patterns. TREE GENETICS & GENOMES, 10(3), 721–737.
    The differentiation of cork (phellem) cells from the phellogen (cork cambium) is a secondary growth process observed in the cork oak tree conferring a unique ability to produce a thick layer of cork. At present, the molecular regulators of phellem differentiation are unknown. The previously documented involvement of microRNAs (miRNAs) in the regulation of developmental processes, including secondary growth, motivated the search for these regulators in cork oak tissues. We performed deep sequencing of the small RNA fraction obtained from cork oak leaves and differentiating phellem. RNA sequences with lengths of 19-25 nt derived from the two libraries were analysed, leading to the identification of 41 families of conserved miRNAs, of which the most abundant were miR167, miR165/166, miR396 and miR159. Thirty novel miRNA candidates were also unveiled, 11 of which were unique to leaves and 13 to phellem. Northern blot detection of a set of conserved and novel miRNAs confirmed their differential expression profile. Prediction and analysis of putative miRNA target genes provided clues regarding processes taking place in leaf and phellem tissues, but further experimental work will be needed for functional characterization. In conclusion, we here provide a first characterization of the miRNA population in a Fagacea species, and the comparative analysis of miRNA expression in leaf and phellem libraries represents an important step to uncovering specific regulatory networks controlling phellem differentiation.
  136. Morreel, K., Saeys, Y., Dima, O., Lu, F., Van de Peer, Y., Vanholme, R., Ralph, J., et al. (2014). Systematic structural characterization of metabolites in Arabidopsis via candidate substrate-product pair networks. PLANT CELL, 26(3), 929–945.
    Plant metabolomics is increasingly used for pathway discovery and to elucidate gene function. However, the main bottleneck is the identification of the detected compounds. This is more pronounced for secondary metabolites as many of their pathways are still underexplored. Here, an algorithm is presented in which liquid chromatography-mass spectrometry profiles are searched for pairs of peaks that have mass and retention time differences corresponding with those of substrates and products from well-known enzymatic reactions. Concatenating the latter peak pairs, called candidate substrate-product pairs (CSPP), into a network displays tentative (bio) synthetic routes. Starting from known peaks, propagating the network along these routes allows the characterization of adjacent peaks leading to their structure prediction. As a proof-of-principle, this high-throughput cheminformatics procedure was applied to the Arabidopsis thaliana leaf metabolome where it allowed the characterization of the structures of 60% of the profiled compounds. Moreover, based on searches in the Chemical Abstract Service database, the algorithm led to the characterization of 61 compounds that had never been described in plants before. The CSPP-based annotation was confirmed by independent MSn experiments. In addition to being high throughput, this method allows the annotation of low-abundance compounds that are otherwise not amenable to isolation and purification. This method will greatly advance the value of metabolomics in systems biology.
  137. Bolton, M. D., de Jonge, R., Inderbitzin, P., Liu, Z., Birla, K., Van de Peer, Y., Subbarao, K. V., et al. (2014). The heterothallic sugarbeet pathogen Cercospora beticola contains exon fragments of both MAT genes that are homogenized by concerted evolution. FUNGAL GENETICS AND BIOLOGY, 62, 43–54.
    Dothideomycetes is one of the most ecologically diverse and economically important classes of fungi. Sexual reproduction in this group is governed by mating type (MAT) genes at the MAT1 locus. Self-sterile (heterothallic) species contain one of two genes at MAT1 (MAT1-1-1 or MAT1-2-1) and only isolates of opposite mating type are sexually compatible. In contrast, self-fertile (homothallic) species contain both MAT genes at MAT1. Knowledge of the reproductive capacities of plant pathogens are of particular interest because recombining populations tend to be more difficult to manage in agricultural settings. In this study, we sequenced MAT1 in the heterothallic Dothideomycete fungus Cercospora beticola to gain insight into the reproductive capabilities of this important plant pathogen. In addition to the expected MAT gene at MAT1, each isolate contained fragments of both MAT1-1-1 and MAT1-2-1 at ostensibly random loci across the genome. When MAT fragments from each locus were manually assembled, they reconstituted MAT1-1-1 and MAT1-2-1 exons with high identity, suggesting a retroposition event occurred in a homothallic ancestor in which both MAT genes were fused. The genome sequences of related taxa revealed that MAT gene fragment pattern of Cercospora zeae-maydis was analogous to C beticola. In contrast, the genome of more distantly related Mycosphaerella graminicola did not contain MAT fragments. Although fragments occurred in syntenic regions of the C bed cola and C zeae-maydis genomes, each MAT fragment was more closely related to the intact MAT gene of the same species. Taken together, these data suggest MAT genes fragmented after divergence of M. graminicola from the remaining taxa, and concerted evolution functioned to homogenize MAT fragments and MAT genes in each species.
  138. Yao, Yao, Marchal, K., & Van de Peer, Y. (2014). Improving the adaptability of simulated evolutionary swarm robots in dynamically changing environments. PLOS ONE, 9(3).
    One of the important challenges in the field of evolutionary robotics is the development of systems that can adapt to a changing environment. However, the ability to adapt to unknown and fluctuating environments is not straightforward. Here, we explore the adaptive potential of simulated swarm robots that contain a genomic encoding of a bio-inspired gene regulatory network (GRN). An artificial genome is combined with a flexible agent-based system, representing the activated part of the regulatory network that transduces environmental cues into phenotypic behaviour. Using an artificial life simulation framework that mimics a dynamically changing environment, we show that separating the static from the conditionally active part of the network contributes to a better adaptive behaviour. Furthermore, in contrast with most hitherto developed ANN-based systems that need to re-optimize their complete controller network from scratch each time they are subjected to novel conditions, our system uses its genome to store GRNs whose performance was optimized under a particular environmental condition for a sufficiently long time. When subjected to a new environment, the previous condition-specific GRN might become inactivated, but remains present. This ability to store 'good behaviour' and to disconnect it from the novel rewiring that is essential under a new condition allows faster re-adaptation if any of the previously observed environmental conditions is reencountered. As we show here, applying these evolutionary-based principles leads to accelerated and improved adaptive evolution in a non-stable environment.
  139. Bracken-Grissom, H., Collins, A. G., Collins, T., Crandall, K., Distel, D., Dunn, C., Giribet, G., et al. (2014). The Global Invertebrate Genomics Alliance (GIGA): developing community resources to study diverse invertebrate genomes. JOURNAL OF HEREDITY, 105(1), 1–18.
    Over 95% of all metazoan (animal) species comprise the invertebrates, but very few genomes from these organisms have been sequenced. We have, therefore, formed a Global Invertebrate Genomics Alliance (GIGA). Our intent is to build a collaborative network of diverse scientists to tackle major challenges (e.g., species selection, sample collection and storage, sequence assembly, annotation, analytical tools) associated with genome/transcriptome sequencing across a large taxonomic spectrum. We aim to promote standards that will facilitate comparative approaches to invertebrate genomics and collaborations across the international scientific community. Candidate study taxa include species from Porifera, Ctenophora, Cnidaria, Placozoa, Mollusca, Arthropoda, Echinodermata, Annelida, Bryozoa, and Platyhelminthes, among others. GIGA will target 7000 noninsect/nonnematode species, with an emphasis on marine taxa because of the unrivaled phyletic diversity in the oceans. Priorities for selecting invertebrates for sequencing will include, but are not restricted to, their phylogenetic placement; relevance to organismal, ecological, and conservation research; and their importance to fisheries and human health. We highlight benefits of sequencing both whole genomes (DNA) and transcriptomes and also suggest policies for genomic-level data access and sharing based on transparency and inclusiveness. The GIGA Web site () has been launched to facilitate this collaborative venture.
  140. Zhurov, V., Navarro, M., Bruinsma, K. A., Arbona, V., Santamaria, M. E., Cazaux, M., … Grbić, V. (2014). Reciprocal responses in the interaction between Arabidopsis and the cell-content feeding chelicerate herbivore spider mite. PLANT PHYSIOLOGY, 164(1), 384–399.
    Most molecular-genetic studies of plant defense responses to arthropod herbivores have focused on insects. However, plant-feeding mites are also pests of diverse plants, and mites induce different patterns of damage to plant tissues than do well-studied insects (e.g. lepidopteran larvae or aphids). The two-spotted spidermite (Tetranychus urticae) is among the most significant mite pests in agriculture, feeding on a staggering number of plant hosts. To understand the interactions between spider mite and a plant at the molecular level, we examined reciprocal genome-wide responses of mites and its host Arabidopsis (Arabidopsis thaliana). Despite differences in feeding guilds, we found that transcriptional responses of Arabidopsis to mite herbivory resembled those observed for lepidopteran herbivores. Mutant analysis of induced plant defense pathways showed functionally that only a subset of induced programs, including jasmonic acid signaling and biosynthesis of indole glucosinolates, are central to Arabidopsis's defense to mite herbivory. On the herbivore side, indole glucosinolates dramatically increased mite mortality and development times. We identified an indole glucosinolate dose-dependent increase in the number of differentially expressedmite genes belonging to pathways associated with detoxification of xenobiotics. This demonstrates that spider mite is sensitive to Arabidopsis defenses that have also been associated with the deterrence of insect herbivores that are very distantly related to chelicerates. Our findings provide molecular insights into the nature of, and response to, herbivory for a representative of a major class of arthropod herbivores.
  141. Fawcett, J., Van de Peer, Y., & Maere, S. (2013). Significance and biological consequences of polyploidization in land plant evolution. In J. Greilhuber, J. Doležel, & J. F. Wendel (Eds.), Physical structure, behaviour and evolution of plant genomes (Vol. 2, pp. 277–293). Vienna, Austria: Springer.
  142. Vanneste, Kevin, Van de Peer, Y., & Maere, S. (2013). Inference of genome duplications from age distributions revisited. MOLECULAR BIOLOGY AND EVOLUTION, 30(1), 177–190.
    Whole-genome duplications (WGDs), thought to facilitate evolutionary innovations and adaptations, have been uncovered in many phylogenetic lineages. WGDs are frequently inferred from duplicate age distributions, where they manifest themselves as peaks against a small-scale duplication background. However, the interpretation of duplicate age distributions is complicated by the use of K-S, the number of synonymous substitutions per synonymous site, as a proxy for the age of paralogs. Two particular concerns are the stochastic nature of synonymous substitutions leading to increasing uncertainty in K-S with increasing age since duplication and K-S saturation caused by the inability of evolutionary models to fully correct for the occurrence of multiple substitutions at the same site. K-S stochasticity is expected to erode the signal of older WGDs, whereas K-S saturation may lead to artificial peaks in the distribution. Here, we investigate the consequences of these effects on K-S-based age distributions and WGD inference by simulating the evolution of duplicated sequences according to predefined real age distributions and re-estimating the corresponding K-S distributions. We show that, although K-S estimates can be used for WGD inference far beyond the commonly accepted K-S threshold of 1, K-S saturation effects can cause artificial peaks at higher ages. Moreover, K-S stochasticity and saturation may lead to confounded peaks encompassing multiple WGD events and/or saturation artifacts. We argue that K-S effects need to be properly accounted for when inferring WGDs from age distributions and that the failure to do so could lead to false inferences.
  143. Andolfo, G., Sanseverino, W., Rombauts, S., Van de Peer, Y., Bradeen, J., Carputo, D., Frusciante, L., et al. (2013). Overview of tomato (Solanum lycopersicum) candidate pathogen recognition genes reveals important Solanum R locus dynamics. NEW PHYTOLOGIST, 197(1), 223–237.
    To investigate the genome-wide spatial arrangement of R loci, a complete catalogue of tomato (Solanum lycopersicum) and potato (Solanum tuberosum) nucleotide-binding site (NBS) NBS, receptor-like protein (RLP) and receptor-like kinase (RLK) gene repertories was generated. Candidate pathogen recognition genes were characterized with respect to structural diversity, phylogenetic relationships and chromosomal distribution. NBS genes frequently occur in clusters of related gene copies that also include RLP or RLK genes. This scenario is compatible with the existence of selective pressures optimizing coordinated transcription. A number of duplication events associated with lineage-specific evolution were discovered. These findings suggest that different evolutionary mechanisms shaped pathogen recognition gene cluster architecture to expand and to modulate the defence repertoire. Analysis of pathogen recognition gene clusters associated with documented resistance function allowed the identification of adaptive divergence events and the reconstruction of the evolution history of these loci. Differences in candidate pathogen recognition gene number and organization were found between tomato and potato. Most candidate pathogen recognition gene orthologues were distributed at less than perfectly matching positions, suggesting an ongoing lineage-specific rearrangement. Indeed, a local expansion of Toll/Interleukin-1 receptor (TIR)-NBS-leucine-rich repeat (LRR) (TNL) genes in the potato genome was evident. Taken together, these findings have implications for improved understanding of the mechanisms of molecular adaptive selection at Solanum R loci.
  144. Van Landeghem, S., Bjorne, J., Wei, C.-H., Hakala, K., Pyysalo, S., Ananiadou, S., Kao, H.-Y., et al. (2013). Large-scale event extraction from literature with multi-level gene normalization. PLOS ONE, 8(4).
    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license.
  145. Zimmer, A. D., Lang, D., Buchta, K., Rombauts, S., Nishiyama, T., Hasebe, M., Van de Peer, Y., et al. (2013). Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions. BMC GENOMICS, 14.
    Background: The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation. Results: Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the cosmoss.org resource as a central repository for this plant "flagship" genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the www.cosmoss.org model organism database. Conclusions: Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5'-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes.
  146. Read, B. A., Kegel, J., Klute, M. J., Kuo, A., Lefebvre, S. C., Maumus, F., Mayer, C., et al. (2013). Pan genome of the phytoplankton Emiliania underpins its global distribution. NATURE, 499(7457), 209–213.
    Coccolithophores have influenced the global climate for over 200 million years(1). These marine phytoplankton can account for 20 per cent of total carbon fixation in some systems(2). They form blooms that can occupy hundreds of thousands of square kilometres and are distinguished by their elegantly sculpted calcium carbonate exoskeletons (coccoliths), rendering them visible from space(3). Although coccolithophores export carbon in the form of organic matter and calcite to the sea floor, they also release CO2 in the calcification process. Hence, they have a complex influence on the carbon cycle, driving either CO2 production or uptake, sequestration and export to the deep ocean(4). Here we report the first haptophyte reference genome, from the coccolithophore Emiliania huxleyi strain CCMP1516, and sequences from 13 additional isolates. Our analyses reveal a pan genome (core genes plus genes distributed variably between strains) probably supported by an atypical complement of repetitive sequence in the genome. Comparisons across strains demonstrate that E. huxleyi, which has long been considered a single species, harbours extensive genome variability reflected in different metabolic repertoires. Genome variability within this species complex seems to underpin its capacity both to thrive in habitats ranging from the equator to the subarctic and to form large-scale episodic blooms under a wide variety of environmental conditions.
  147. Galagan, J. E., Minch, K., Peterson, M., Lyubetskaya, A., Azizi, E., Sweet, L., Gomes, A., et al. (2013). The Mycobacterium tuberculosis regulatory network and hypoxia. NATURE, 499(7457), 178–183.
    We have taken the first steps towards a complete reconstruction of the Mycobacterium tuberculosis regulatory network based on ChIP-Seq and combined this reconstruction with system-wide profiling of messenger RNAs, proteins, metabolites and lipids during hypoxia and re-aeration. Adaptations to hypoxia are thought to have a prominent role in M. tuberculosis pathogenesis. Using ChIP-Seq combined with expression data from the induction of the same factors, we have reconstructed a draft regulatory network based on 50 transcription factors. This network model revealed a direct interconnection between the hypoxic response, lipid catabolism, lipid anabolism and the production of cell wall lipids. As a validation of this model, in response to oxygen availability we observe substantial alterations in lipid content and changes in gene expression and metabolites in corresponding metabolic pathways. The regulatory network reveals transcription factors underlying these changes, allows us to computationally predict expression changes, and indicates that Rv0081 is a regulatory hub.
  148. Nystedt, B., Street, N. R., Wetterbom, A., Zuccolo, A., Lin, Y.-C., Scofield, D. G., Vezzi, F., et al. (2013). The Norway spruce genome sequence and conifer genome evolution. NATURE, 497(7451), 579–584.
    Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.
  149. Ruttink, T., Sterck, L., Rohde, A., Bendixen, C., Rouzé, P., Asp, T., Van de Peer, Y., et al. (2013). Orthology Guided Assembly in highly heterozygous crops: creating a reference transcriptome to uncover genetic diversity in Lolium perenne. PLANT BIOTECHNOLOGY JOURNAL, 11(5), 605–617.
    Despite current advances in next-generation sequencing data analysis procedures, de novo assembly of a reference sequence required for SNP discovery and expression analysis is still a major challenge in genetically uncharacterized, highly heterozygous species. High levels of polymorphism inherent to outbreeding crop species hamper De Bruijn Graph-based de novo assembly algorithms, causing transcript fragmentation and the redundant assembly of allelic contigs. If multiple genotypes are sequenced to study genetic diversity, primary de novo assembly is best performed per genotype to limit the level of polymorphism and avoid transcript fragmentation. Here, we propose an Orthology Guided Assembly procedure that first uses sequence similarity (tBLASTn) to proteins of a model species to select allelic and fragmented contigs from all genotypes and then performs CAP3 clustering on a gene-by-gene basis. Thus, we simultaneously annotate putative orthologues for each protein of the model species, resolve allelic redundancy and fragmentation and create a de novo transcript sequence representing the consensus of all alleles present in the sequenced genotypes. We demonstrate the procedure using RNA-seq data from 14 genotypes of Lolium perenne to generate a reference transcriptome for gene discovery and translational research, to reveal the transcriptome-wide distribution and density of SNPs in an outbreeding crop and to illustrate the effect of polymorphisms on the assembly procedure. The results presented here illustrate that constructing a non-redundant reference sequence is essential for comparative genomics, orthology-based annotation and candidate gene selection but also for read mapping and subsequent polymorphism discovery and/or read count-based gene expression analysis.
  150. Vandepoele, Klaas, Van Bel, M., Richard, G., Van Landeghem, S., Verhelst, B., Moreau, H., Van de Peer, Y., et al. (2013). pico-PLAZA, a genome database of microbial photosynthetic eukaryotes. ENVIRONMENTAL MICROBIOLOGY, 15(8), 2147–2153.
    With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. PLAZA can be used to functionally characterize large-scale ES /RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylumtricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains.
  151. Van Bogaert, Inge, Holvoet, K., Roelants, S., Li, B., Lin, Y.-C., Van de Peer, Y., & Soetaert, W. (2013). The biosynthetic gene cluster for sophorolipids : a biotechnological interesting biosurfactant produced by Starmerella bombicola. MOLECULAR MICROBIOLOGY, 88(3), 501–509.
    Sophorolipids are promising biological derived surfactants or detergents which find application in household cleaning, personal care and cosmetics. They are produced by specific yeast species and among those, Starmerella bombicola (former Candida bombicola) is the most widely used and studied one. Despite the commercial interest in sophorolipids, the biosynthetic pathway of these secondary metabolites remained hitherto partially unsolved. In this manuscript we present the sophorolipid gene cluster consisting of five genes directly involved in sophorolipid synthesis: a cytochrome P450 monooxygenase, two glucosyltransferases, an acetyltransferase and a transporter. It was demonstrated that disabling the first step of the pathway cytochrome P450 monooxygenase mediated terminal or subterminal hydroxylation of a common fatty acid results in complete abolishment of sophorolipid production. This phenotype could be complemented by supplying the yeast with hydroxylated fatty acids. On the other hand, knocking out the transporter gene yields mutants still able to secrete sophorolipids, though only at levels of 10% as compared with the wild type, suggesting alternative routes for secretion. Finally, it was proved that hampering sophorolipid production does not affect cell growth or cell viability in laboratory conditions, as can be expected for secondary metabolites.
  152. Van Landeghem, S., De Bodt, S., Drebert, Z., Inzé, D., & Van de Peer, Y. (2013). The potential of text mining in data integration and network biology for plant research : a case study on Arabidopsis. PLANT CELL, 25(3), 794–807.
    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.
  153. De Smet, Riet, Adams, K. L., Vandepoele, K., Van Montagu, M., Maere, S., & Van de Peer, Y. (2013). Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 110(8), 2898–2903.
    The importance of gene gain through duplication has long been appreciated. In contrast, the importance of gene loss has only recently attracted attention. Indeed, studies in organisms ranging from plants to worms and humans suggest that duplication of some genes might be better tolerated than that of others. Here we have undertaken a large-scale study to investigate the existence of duplication-resistant genes in the sequenced genomes of 20 flowering plants. We demonstrate that there is a large set of genes that is convergently restored to single-copy status following multiple genome-wide and smaller scale duplication events. We rule out the possibility that such a pattern could be explained by random gene loss only and therefore propose that there is selection pressure to preserve such genes as singletons. This is further substantiated by the observation that angiosperm single-copy genes do not comprise a random fraction of the genome, but instead are often involved in essential housekeeping functions that are highly conserved across all eukaryotes. Furthermore, single-copy genes are generally expressed more highly and in more tissues than non-single-copy genes, and they exhibit higher sequence conservation. Finally, we propose different hypotheses to explain their resistance against duplication.
  154. Van Bel, M., Proost, S., Van Neste, C., Deforce, D., Van de Peer, Y., & Vandepoele, K. (2013). TRAPID : an efficient online tool for the functional and comparative analysis of de novo RNA-Seq transcriptomes. GENOME BIOLOGY, 14(12).
    Transcriptome analysis through next-generation sequencing technologies allows the generation of detailed gene catalogs for non-model species, at the cost of new challenges with regards to computational requirements and bioinformatics expertise. Here, we present TRAPID, an online tool for the fast and efficient processing of assembled RNA-Seq transcriptome data, developed to mitigate these challenges. TRAPID offers high-throughput open reading frame detection, frameshift correction and includes a functional, comparative and phylogenetic toolbox, making use of 175 reference proteomes. Benchmarking and comparison against state-of-the-art transcript analysis tools reveals the efficiency and unique features of the TRAPID system.
  155. De Clercq, I., Vermeirssen, V., Van Aken, O., Vandepoele, K., Murcha, M. W., Law, S. R., Inzé, A., et al. (2013). The membrane-bound NAC transcription factor ANAC013 functions in mitochondrial retrograde regulation of the oxidative stress response in Arabidopsis. PLANT CELL, 25(9), 3472–3490.
    Upon disturbance of their function by stress, mitochondria can signal to the nucleus to steer the expression of responsive genes. This mitochondria-to-nucleus communication is often referred to as mitochondrial retrograde regulation (MRR). Although reactive oxygen species and calcium are likely candidate signaling molecules for MRR, the protein signaling components in plants remain largely unknown. Through meta-analysis of transcriptome data, we detected a set of genes that are common and robust targets of MRR and used them as a bait to identify its transcriptional regulators. In the upstream regions of these mitochondrial dysfunction stimulon (MDS) genes, we found a cis-regulatory element, the mitochondrial dysfunction motif (MDM), which is necessary and sufficient for gene expression under various mitochondrial perturbation conditions. Yeast one-hybrid analysis and electrophoretic mobility shift assays revealed that the transmembrane domain-containing NO APICAL MERISTEM/ARABIDOPSIS TRANSCRIPTION ACTIVATION FACTOR/CUP-SHAPED COTYLEDON transcription factors (ANAC013, ANAC016, ANAC017, ANAC053, and ANAC078) bound to the MDM cis-regulatory element. We demonstrate that ANAC013 mediates MRRinduced expression of the MDS genes by direct interaction with the MDMcis-regulatory element and triggers increased oxidative stress tolerance. In conclusion, we characterized ANAC013 as a regulator of MRR upon stress in Arabidopsis thaliana.
  156. Verhelst, Bram, Van de Peer, Y., & Rouzé, P. (2013). The complex intron landscape and massive intron invasion in a picoeukaryote provides insights into intron evolution. GENOME BIOLOGY AND EVOLUTION, 5(12), 2393–2401.
    Genes in pieces and spliceosomal introns are a landmark of eukaryotes, with intron invasion usually assumed to have happened early on in evolution. Here, we analyse the intron landscape of Micromonas, a unicellular green alga in the Mamiellophyceae lineage, demonstrating the co-existence of several classes of introns and the occurrence of recent massive intron invasion. This study focuses on two strains, CCMP1545 and RCC299, and their related individuals from ocean samplings, showing that they not only harbour different classes of introns depending on their location in the genome, as for other Mamiellophyceae, but uniquely carry several classes of repeat introns. These introns, dubbed introner elements (IEs), are found at novel positions in genes and have conserved sequences, contrary to canonical introns. This IE invasion has a huge impact on the genome, doubling the number of introns in the CCMP1545 strain. We hypothesize that each IE class originated from a single ancestral IE that has been colonizing the genome after strain divergence by inserting copies of itself into genes by intron transposition, likely involving reverse splicing. Along with similar cases recently observed in other organisms, our observations in Micromonas strains shed a new light on the evolution of introns, suggesting that intron gain is more widespread than previously thought.
  157. Ciesielska, K., Li, B., Groeneboer, S., Van Bogaert, I., Lin, Y.-C., Soetaert, W., Van de Peer, Y., et al. (2013). SILAC-based proteome analysis of Starmerella bombicola sophorolipid production. JOURNAL OF PROTEOME RESEARCH, 12(10), 4376–4392.
    Starmerella (Candida) bombicola is the biosurfactant-producing species that caught the greatest deal of attention in the academic and industrial world due to its ability of producing large amounts of sophorolipids. Despite its high economic potential, the biochemistry behind the sophorolipid biosynthesis is still poorly understood. Here we present the first proteomic characterization of S. bombicola for which we created a lys1 Delta. mutant to allow the use of SILAC for quantitative analysis. To characterize the processes behind the production of these biosurfactants, we compared the proteome of sophorolipid producing (early stationary phase) and nonproducing cells (exponential phase). We report the simultaneous production of all known enzymes involved in sophorolipid biosynthesis including a predicted sophorolipid transporter. In addition, we identified the heme binding protein Dap1 as a possible regulator for Cyp52M1. Our results further indicate that ammonium and phosphate limitation are not the sole limiting factors inducing sophorolipid biosynthesis.
  158. Roelants, S., Saerens, K., Derycke, T., Li, B., Lin, Y.-C., Van de Peer, Y., De Maeseneire, S., et al. (2013). Candida bombicola as a platform organism for the production of tailor-made biomolecules. BIOTECHNOLOGY AND BIOENGINEERING, 110(9), 2494–2503.
  159. Van Bel, M., Proost, S., Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer, Y., & Vandepoele, K. (2012). Dissecting plant genomes with the PLAZA comparative genomics platform. PLANT PHYSIOLOGY, 158(2), 590–600.
    With the arrival of low-cost, next-generation sequencing, a multitude of new plant genomes are being publicly released, providing unseen opportunities and challenges for comparative genomics studies. Here, we present PLAZA 2.5, a user-friendly online research environment to explore genomic information from different plants. This new release features updates to previous genome annotations and a substantial number of newly available plant genomes as well as various new interactive tools and visualizations. Currently, PLAZA hosts 25 organisms covering a broad taxonomic range, including 13 eudicots, five monocots, one lycopod, one moss, and five algae. The available data consist of structural and functional gene annotations, homologous gene families, multiple sequence alignments, phylogenetic trees, and colinear regions within and between species. A new Integrative Orthology Viewer, combining information from different orthology prediction methodologies, was developed to efficiently investigate complex orthology relationships. Cross-species expression analysis revealed that the integration of complementary data types extended the scope of complex orthology relationships, especially between more distantly related species. Finally, based on phylogenetic profiling, we propose a set of core gene families within the green plant lineage that will be instrumental to assess the gene space of draft or newly sequenced plant genomes during the assembly or annotation phase.
  160. Whitford, R., Fernandez Salina, A., Tejos Ulloa, R., Cuéllar Pérez, A., Kleine-Vehn, J., Vanneste, S., Drozdzecki, A., et al. (2012). GOLVEN secretory peptides regulate auxin carrier turnover during plant gravitropic responses. DEVELOPMENTAL CELL, 22(3), 678–685.
  161. Hacquard, S., Joly, D. L., Lin, Y.-C., Tisserant, E., Feau, N., Delaruelle, C., Legué, V., et al. (2012). A comprehensive analysis of genes encoding small secreted proteins identifies candidate effectors in Melampsora larici-populina (poplar leaf rust). MOLECULAR PLANT-MICROBE INTERACTIONS, 25(3), 279–293.
    The obligate biotrophic rust fungus Melampsora larici-populina is the most devastating and widespread pathogen of poplars. Studies over recent years have identified various small secreted proteins (SSP) from plant biotrophic filamentous pathogens and have highlighted their role as effectors in host-pathogen interactions. The recent analysis of the M. larici-populina genome sequence has revealed the presence of 1,184 SSP-encoding genes in this rust fungus. In the present study, the expression and evolutionary dynamics of these SSP were investigated to pinpoint the arsenal of putative effectors that could be involved in the interaction between the rust fungus and poplar. Similarity with effectors previously described in Melampsora spp., richness in cysteines, and organization in large families were extensively detailed and discussed. Positive selection analyses conducted over clusters of paralogous genes revealed fast-evolving candidate effectors. Transcript profiling of selected M. laricipopulina SSP showed a timely coordinated expression during leaf infection, and the accumulation of four candidate effectors in distinct rust infection structures was demonstrated by immunolocalization. This integrated and multifaceted approach helps to prioritize candidate effector genes for functional studies
  162. Malacarne, G., Perazzolli, M., Cestaro, A., Sterck, L., Fontana, P., Van de Peer, Y., Viola, R., et al. (2012). Deconstruction of the (paleo)polyploid grapevine genome based on the analysis of transposition events involving NBS resistance genes. PLOS ONE, 7(1).
    Plants have followed a reticulate type of evolution and taxa have frequently merged via allopolyploidization. A polyploid structure of sequenced genomes has often been proposed, but the chromosomes belonging to putative component genomes are difficult to identify. The 19 grapevine chromosomes are evolutionary stable structures: their homologous triplets have strongly conserved gene order, interrupted by rare translocations. The aim of this study is to examine how the grapevine nucleotide-binding site (NBS)-encoding resistance (NBS-R) genes have evolved in the genomic context and to understand mechanisms for the genome evolution. We show that, in grapevine, i) helitrons have significantly contributed to transposition of NBS-R genes, and ii) NBS-R gene cluster similarity indicates the existence of two groups of chromosomes (named as Va and Vc) that may have evolved independently. Chromosome triplets consist of two Va and one Vc chromosomes, as expected from the tetraploid and diploid conditions of the two component genomes. The hexaploid state could have been derived from either allopolyploidy or the separation of the Va and Vc component genomes in the same nucleus before fusion, as known for Rosaceae species. Time estimation indicates that grapevine component genomes may have fused about 60 mya, having had at least 40-60 mya to evolve independently. Chromosome number variation in the Vitaceae and related families, and the gap between the time of eudicot radiation and the age of Vitaceae fossils, are accounted for by our hypothesis.
  163. Abeel, T., Van Parys, T., Saeys, Y., Galagan, J., & Van de Peer, Y. (2012). GenomeView : a next-generation genome browser. NUCLEIC ACIDS RESEARCH, 40(2).
    Due to ongoing advances in sequencing technologies, billions of nucleotide sequences are now produced on a daily basis. A major challenge is to visualize these data for further downstream analysis. To this end, we present GenomeView, a stand-alone genome browser specifically designed to visualize and manipulate a multitude of genomics data. GenomeView enables users to dynamically browse high volumes of aligned short-read data, with dynamic navigation and semantic zooming, from the whole genome level to the single nucleotide. At the same time, the tool enables visualization of whole genome alignments of dozens of genomes relative to a reference sequence. GenomeView is unique in its capability to interactively handle huge data sets consisting of tens of aligned genomes, thousands of annotation features and millions of mapped short reads both as viewer and editor. GenomeView is freely available as an open source software package.
  164. Fawcett, J., Rouzé, P., & Van de Peer, Y. (2012). Higher intron loss rate in Arabidopsis thaliana than A. lyrata is consistent with stronger selection for a smaller genome. MOLECULAR BIOLOGY AND EVOLUTION, 29(2), 849–859.
    The number of introns varies considerably among different organisms. This can be explained by the differences in the rates of intron gain and loss. Two factors that are likely to influence these rates are selection for or against introns and the mutation rate that generates the novel intron or the intronless copy. Although it has been speculated that stronger selection for a compact genome might result in a higher rate of intron loss and a lower rate of intron gain, clear evidence is lacking, and the role of selection in determining these rates has not been established. Here, we studied the gain and loss of introns in the two closely related species Arabidopsis thaliana and A. lyrata as it was recently shown that A. thaliana has been undergoing a faster genome reduction driven by selection. We found that A. thaliana has lost six times more introns than A. lyrata since the divergence of the two species but gained very few introns. We suggest that stronger selection for genome reduction probably resulted in the much higher intron loss rate in A. thaliana, although further analysis is required as we could not find evidence that the loss rate increased in A. thaliana as opposed to having decreased in A. lyrata compared with the rate in the common ancestor. We also examined the pattern of the intron gains and losses to better understand the mechanisms by which they occur. Microsimilarity was detected between the splice sites of several gained and lost introns, suggesting that nonhomologous end joining repair of double-strand breaks might be a common pathway not only for intron gain but also for intron loss.
  165. Proost, Sebastian, Fostier, J., De Witte, D., Dhoedt, B., Demeester, P., Van de Peer, Y., & Vandepoele, K. (2012). i-ADHoRe 3.0 : fast and sensitive detection of genomic homology in extremely large data sets. NUCLEIC ACIDS RESEARCH, 40(2).
  166. Van de Peer, Y., & ChrisPires, J. (2012). Getting up to speed. CURRENT OPINION IN PLANT BIOLOGY.
  167. Van Landeghem, S., Björne, J., Abeel, T., De Baets, B., Salakoski, T., & Van de Peer, Y. (2012). Semantically linking molecular entities in literature through entity relationships. BMC BIOINFORMATICS, 13. Presented at the Conference on BioNLP Shared Task.
    Background: Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts. Results: We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score >90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts. Conclusions: The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale.
  168. Van Landeghem, S., Hakala, K., Rönnqvist, S., Salakoski, T., Van de Peer, Y., & Ginter, F. (2012). Exploring biomolecular literature with EVEX : connecting genes through events, homology, and indirect associations. ADVANCES IN BIOINFORMATICS, 2012.
    Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
  169. Björne, J., Van Landeghem, S., Pyysalo, S., Ohta, T., Ginter, F., Van de Peer, Y., Ananiadou, S., et al. (2012). PubMed-scale event extraction for post-translational modifications, epigenetics and protein structural relations. Proceedings of the 2012 workshop on biomedical natural language processing (pp. 82–90). Presented at the 2012 Workshop on Biomedical Natural Language Processing (BioNLP 2012), Association for Computational Linguistics (ACL).
    Recent efforts in biomolecular event extraction have mainly focused on core event types involving genes and proteins, such as gene expression, protein-protein interactions, and protein catabolism. The BioNLP’11 Shared Task extended the event extraction approach to sub-protein events and relations in the Epigenetics and Post-translational Modifications (EPI) and Protein Relations (REL) tasks. In this study, we apply the Turku Event Extraction System, the best-performing system for these tasks, to all PubMed abstracts and all available PMC full-text articles, extracting 1.4M EPI events and 2.2M REL relations from 21M abstracts and 372K articles. We introduce several entity normalization algorithms for genes, proteins, protein complexes and protein components, aiming to uniquely identify these biological entities. This normalization effort allows direct mapping of the extracted events and relations with posttranslational modifications from UniProt, epigenetics from PubMeth, functional domains from InterPro and macromolecular structures from PDB. The extraction of such detailed protein information provides a unique text mining dataset, offering the opportunity to further deepen the information provided by existing PubMed-scale event extraction efforts. The methods and data introduced in this study are freely available from bionlp.utu.fi
  170. Amoutzias, G. D., He, Y., Lilley, K. S., Van de Peer, Y., & Oliver, S. G. (2012). Evaluation and properties of the budding yeast phosphoproteome. MOLECULAR & CELLULAR PROTEOMICS, 11(6).
    We have assembled a reliable phosphoproteomic data set for budding yeast Saccharomyces cerevisiae and have investigated its properties. Twelve publicly available phosphoproteome data sets were triaged to obtain a subset of high-confidence phosphorylation sites (p-sites), free of "noisy" phosphorylations. Analysis of this combined data set suggests that the inventory of phosphoproteins in yeast is close to completion, but that these proteins may have many undiscovered p-sites. Proteins involved in budding and protein kinase activity have high numbers of p-sites and are highly over-represented in the vast majority of the yeast phosphoproteome data sets. The yeast phosphoproteome is characterized by a few proteins with many p-sites and many proteins with a few p-sites. We confirm a tendency for p-sites to cluster together and find evidence that kinases may phosphorylate off-target amino acids that are within one or two residues of their cognate target. This suggests that the precise position of the phosphorylated amino acid is not a stringent requirement for regulatory fidelity. Compared with nonphosphorylated proteins, phosphoproteins are more ancient, more abundant, have longer unstructured regions, have more genetic interactions, more protein interactions, and are under tighter post-translational regulation. It appears that phosphoproteins constitute the raw material for pathway rewiring and adaptation at various evolutionary rates.
  171. Brown, JR, Hanna, M., Tesar, B., Pochet, N., Vartanov, A., Fernandes, S., Werner, L., et al. (2012). Germline copy number variation associated with Mendelian inheritance of CLL in two families. LEUKEMIA, 26(7), 1710–1713.
  172. Brown, J. R., Hanna, M., Tesar, B., Werner, L., Pochet, N., Asara, J. M., Wang, Y. E., et al. (2012). Integrative genomic analysis implicates gain of PIK3CA at 3q26 and MYC at 8q24 in chronic lymphocytic leukemia. CLINICAL CANCER RESEARCH, 18(14), 3791–3802.
    Purpose: The disease course of chronic lymphocytic leukemia (CLL) varies significantly within cytogenetic groups. We hypothesized that high-resolution genomic analysis of CLL would identify additional recurrent abnormalities associated with short time-to-first therapy (TTFT). Experimental Design: We undertook high-resolution genomic analysis of 161 prospectively enrolled CLLs using Affymetrix 6.0 SNP arrays, and integrated analysis of this data set with gene expression profiles. Results: Copy number analysis (CNA) of nonprogressive CLL reveals a stable genotype, with a median of only 1 somatic CNA per sample. Progressive CLL with 13q deletion was associated with additional somatic CNAs, and a greater number of CNAs was predictive of TTFT. We identified other recurrent CNAs associated with short TTFT: 8q24 amplification focused on the cancer susceptibility locus near MYC in 3.7%; 3q26 amplifications focused on PIK3CA in 5.6%; and 8p deletions in 5% of patients. Sequencing of MYC further identified somatic mutations in two CLLs. We determined which catalytic subunits of phosphoinositide 3-kinase (PI3K) were in active complex with the p85 regulatory subunit and showed enrichment for the a subunit in three CLLs carrying PIK3CA amplification. Conclusions: Our findings implicate amplifications of 3q26 focused on PIK3CA and 8q24 focused on MYC in CLL.
  173. De Smet, Riet, & Van de Peer, Y. (2012). Redundancy and rewiring of genetic networks following genome-wide duplication events. CURRENT OPINION IN PLANT BIOLOGY, 15(2), 168–176.
    Polyploidy or whole-genome duplication is a frequent phenomenon within the plant kingdom and has been associated with the occurrence of evolutionary novelty and increase in biological complexity. Because genome-wide duplication events duplicate whole molecular networks it is of interest to investigate how these networks evolve subsequent to such events. Although genome duplications are generally followed by massive gene loss, at least part of the network is usually retained in duplicate and can rewire to execute novel functions. Alternatively, the network can remain largely redundant and as such confer robustness against mutations. The increasing availability of high-throughput data makes it possible to study evolution following whole genome duplication events at the network level. Here we discuss how the use of 'omics' data in network analysis can provide novel insights on network redundancy and rewiring and conclude with some directions for future research.
  174. Milner, D. A., Jr, Pochet, N., Krupka, M., Williams, C., Seydel, K., Taylor, T., Van de Peer, Y., et al. (2012). Transcriptional profiling of Plasmodium falciparum parasites from patients with severe malaria identifies distinct low vs. high parasitemic clusters. PLOS ONE, 7(7).
    Background: In the past decade, estimates of malaria infections have dropped from 500 million to 225 million per year; likewise, mortality rates have dropped from 3 million to 791,000 per year. However, approximately 90% of these deaths continue to occur in sub-Saharan Africa, and 85% involve children less than 5 years of age. Malaria mortality in children generally results from one or more of the following clinical syndromes: severe anemia, acidosis, and cerebral malaria. Although much is known about the clinical and pathological manifestations of CM, insights into the biology of the malaria parasite, specifically transcription during this manifestation of severe infection, are lacking. Methods and Findings: We collected peripheral blood from children meeting the clinical case definition of cerebral malaria from a cohort in Malawi, examined the patients for the presence or absence of malaria retinopathy, and performed whole genome transcriptional profiling for Plasmodium falciparum using a custom designed Affymetrix array. We identified two distinct physiological states that showed highly significant association with the level of parasitemia. We compared both groups of Malawi expression profiles with our previously acquired ex vivo expression profiles of parasites derived from infected patients with mild disease; a large collection of in vitro Plasmodium falciparum life cycle gene expression profiles; and an extensively annotated compendium of expression data from Saccharomyces cerevisiae. The high parasitemia patient group demonstrated a unique biology with elevated expression of Hrd1, a member of endoplasmic reticulum-associated protein degradation system. Conclusions: The presence of a unique high parasitemia state may be indicative of the parasite biology of the clinically recognized hyperparasitemic severe disease syndrome.
  175. Sato, S., Tabata, S., Hirakawa, H., Asamizu, E., Shirasawa, K., Isobe, S., Kaneko, T., et al. (2012). The tomato genome sequence provides insights into fleshy fruit evolution. NATURE, 485(7400), 635–641.
    Tomato (Solanum lycopersicum) is a major crop plant and a model system for fruit development. Solanum is one of the largest angiosperm genera(1) and includes annual and perennial plants from diverse habitats. Here we present a high-quality genome sequence of domesticated tomato, a draft sequence of its closest wild relative, Solanum pimpinellifolium(2), and compare them to each other and to the potato genome (Solanum tuberosum). The two tomato genomes show only 0.6% nucleotide divergence and signs of recent admixture, but show more than 8% divergence from potato, with nine large and several smaller inversions. In contrast to Arabidopsis, but similar to soybean, tomato and potato small RNAs map predominantly to gene-rich chromosomal regions, including gene promoters. The Solanum lineage has experienced two consecutive genome triplications: one that is ancient and shared with rosids, and a more recent one. These triplications set the stage for the neofunctionalization of genes controlling fruit characteristics, such as colour and fleshiness.
  176. Moreau, H., Verhelst, B., Couloux, A., Derelle, E., Rombauts, S., Grimsley, N., Van Bel, M., et al. (2012). Gene functionalities and genome structure in Bathycoccus prasinos reflect cellular specializations at the base of the green lineage. GENOME BIOLOGY, 13(8).
    Background: Bathycoccus prasinos is an extremely small cosmopolitan marine green alga whose cells are covered with intricate spider's web patterned scales that develop within the Golgi cisternae before their transport to the cell surface. The objective of this work is to sequence and analyze its genome, and to present a comparative analysis with other known genomes of the green lineage. Research: Its small genome of 15 Mb consists of 19 chromosomes and lacks transposons. Although 70% of all B. prasinos genes share similarities with other Viridiplantae genes, up to 428 genes were probably acquired by horizontal gene transfer, mainly from other eukaryotes. Two chromosomes, one big and one small, are atypical, an unusual synapomorphic feature within the Mamiellales. Genes on these atypical outlier chromosomes show lower GC content and a significant fraction of putative horizontal gene transfer genes. Whereas the small outlier chromosome lacks colinearity with other Mamiellales and contains many unknown genes without homologs in other species, the big outlier shows a higher intron content, increased expression levels and a unique clustering pattern of housekeeping functionalities. Four gene families are highly expanded in B. prasinos, including sialyltransferases, sialidases, ankyrin repeats and zinc ion-binding genes, and we hypothesize that these genes are associated with the process of scale biogenesis. Conclusion: The minimal genomes of the Mamiellophyceae provide a baseline for evolutionary and functional analyses of metabolic processes in green plants.
  177. Sterck, L., Billiau, K., Abeel, T., Rouzé, P., & Van de Peer, Y. (2012). ORCAE: online resource for community annotation of eukaryotes. NATURE METHODS, 9(11), 1041–1041.
  178. Vekemans, D., Proost, S., Vanneste, K., Coenen, H., Viaene, T., Ruelens, P., Maere, S., et al. (2012). Gamma paleohexaploidy in the stem lineage of core eudicots: significance for MADS-box gene and species diversification. MOLECULAR BIOLOGY AND EVOLUTION, 29(12), 3793–3806.
    Comparative genome biology has unveiled the polyploid origin of all angiosperms and the role of recurrent polyploidization in the amplification of gene families and the structuring of genomes. Which species share certain ancient polyploidy events, and which do not, is ill defined because of the limited number of sequenced genomes and transcriptomes and their uneven phylogenetic distribution. Previously, it has been suggested that most, but probably not all, of the eudicots have shared an ancient hexaploidy event, referred to as the gamma triplication. In this study, detailed phylogenies of subfamilies of MADS-box genes suggest that the gamma triplication has occurred before the divergence of Gunnerales but after the divergence of Buxales and Trochodendrales. Large-scale phylogenetic and K-S-based approaches on the inflorescence transcriptomes of Gunnera manicata (Gunnerales) and Pachysandra terminalis (Buxales) provide further support for this placement, enabling us to position the gamma triplication in the stem lineage of the core eudicots. This triplication likely initiated the functional diversification of key regulators of reproductive development in the core eudicots, comprising 75% of flowering plants. Although it is possible that the gamma event triggered early core eudicot diversification, our dating estimates suggest that the event occurred early in the stem lineage, well before the rapid speciation of the earliest core eudicot lineages. The evolutionary significance of this paleopolyploidy event may thus rather lie in establishing a species lineage that was resilient to extinction, but with the genomic potential for later diversification. We consider that the traits generated from this potential characterize extant core eudicots both chemically and morphologically.
  179. Klochendler, A., Weinberg-Corem, N., Moran, M., Swisa, A., Pochet, N., Savova, V., Vikeså, J., et al. (2012). A transgenic mouse marking live replicating cells reveals in vivo transcriptional program of proliferation. DEVELOPMENTAL CELL, 23(4), 681–690.
    Most adult mammalian tissues are quiescent, with rare cell divisions serving to maintain homeostasis. At present, the isolation and study of replicating cells from their in vivo niche typically involves immunostaining for intracellular markers of proliferation, causing the loss of sensitive biological material. We describe a transgenic mouse strain, expressing a CyclinB1-GFP fusion reporter, that marks replicating cells in the S/G2/M phases of the cell cycle. Using flow cytometry, we isolate live replicating cells from the liver and compare their transcriptome to that of quiescent cells to reveal gene expression programs associated with cell proliferation in vivo. We find that replicating hepatocytes have reduced expression of genes characteristic of liver differentiation. This reporter system provides a powerful platform for gene expression and metabolic and functional studies of replicating cells in their in vivo niche.
  180. Cock, J. M., Sterck, L., Ahmed, S., Allen, A. E., Amoutzias, G., Anthouard, V., Artiguenave, F., et al. (2012). The Ectocarpus genome and brown algal genomics : the Ectocarpus Genome Consortium. (G Piganeau, Ed.)Advances in Botanical Research, 64, 141–184.
    Brown algae are important organisms both because of their key ecological roles in coastal ecosystems and because of the remarkable biological features that they have acquired during their unusual evolutionary history. The recent sequencing of the complete genome of the filamentous brown alga Ectocarpus has provided unprecedented access to the molecular processes that underlie brown algal biology. Analysis of the genome sequence, which exhibits several unusual structural features, identified genes that are predicted to play key roles in several aspects of brown algal metabolism, in the construction of the multicellular bodyplan and in resistance to biotic and abiotic stresses. Information from the genome sequence is currently being used in combination with other genomic, genetic and biochemical tools to further investigate these and other aspects of brown algal biology at the molecular level. Here, we review some of the major discoveries that emerged from the analysis of the Ectocarpus genome sequence, with a particular focus on the unusual genome structure, inferences about brown algal evolution and novel aspects of brown algal metabolism.
  181. Torres Torres, G. A., Marchal, K., Van de Peer, Y., & De Cock, M. (2012). An ASP-based simulation method for finding all synchronous and asynchronous attractors in genetic regulatory networks. In ISBC Student Council, 8th Symposium, Abstracts. Long Beach, CA, USA: International Society for Computational Biology (ISCB).
  182. Torres Torres, G. A., Marchal, K., Van de Peer, Y., & De Cock, M. (2012). Predicting long term behavior of genetic regulatory networks with answer set programming. In B. De Baets, B. Manderick, M. Rademaker, & W. Waegeman (Eds.), Proceedings of the 21st Belgian-Dutch conference on machine learning. Ghent, Belgium.
  183. Murat, F., Van de Peer, Y., & Salse, J. (2012). Decoding plant and animal genome plasticity from differential paleo-evolutionary patterns and processes. GENOME BIOLOGY AND EVOLUTION, 4(9), 917–928.
    Continuing advances in genome sequencing technologies and computational methods for comparative genomics currently allow inferring the evolutionary history of entire plant and animal genomes. Based on the comparison of the plant and animal genome paleohistory, major differences are unveiled in 1) evolutionary mechanisms (i.e., polyploidization versus diploidization processes), 2) genome conservation (i.e., coding versus noncoding sequence maintenance), and 3) modern genome architecture (i.e., genome organization including repeats expansion versus contraction phenomena). This article discusses how extant animal and plant genomes are the result of inherently different rates and modes of genome evolution resulting in relatively stable animal and much more dynamic and plastic plant genomes.
  184. Van Jaarsveld, I., Mizrachi, E., Joubert, F., Van de Peer, Y., & Myburg, A. (2012). Ensemble optimisation of cis-regulatory element discovery : in planta benchmark and discovery in Eucalyptus. SOUTH AFRICAN JOURNAL OF BOTANY (Vol. 79, pp. 218–219).
  185. Baele, G., Van de Peer, Y., & Vansteelandt, S. (2011). Context-dependent codon partition models provide significant increases in model fit in atpB and rbcL protein-coding genes. BMC EVOLUTIONARY BIOLOGY, 11.
    Background: Accurate modelling of substitution processes in protein-coding sequences is often hampered by the computational burdens associated with full codon models. Lately, codon partition models have been proposed as a viable alternative, mimicking the substitution behaviour of codon models at a low computational cost. Such codon partition models however impose independent evolution of the different codon positions, which is overly restrictive from a biological point of view. Given that empirical research has provided indications of context-dependent substitution patterns at four-fold degenerate sites, we take those indications into account in this paper.Results: We present so-called context-dependent codon partition models to assess previous empirical claims that the evolution of four-fold degenerate sites is strongly dependent on the composition of its two flanking bases. To this end, we have estimated and compared various existing independent models, codon models, codon partition models and context-dependent codon partition models for the atpB and rbcL genes of the chloroplast genome, which are frequently used in plant systematics. Such context-dependent codon partition models employ a full dependency scheme for four-fold degenerate sites, whilst maintaining the independence assumption for the first and second codon positions. Conclusions: We show that, both in the atpB and rbcL alignments of a collection of land plants, these context-dependent codon partition models significantly improve model fit over existing codon partition models. Using Bayes factors based on thermodynamic integration, we show that in both datasets the same context-dependent codon partition model yields the largest increase in model fit compared to an independent evolutionary model. Context-dependent codon partition models hence perform closer to codon models, which remain the best performing models at a drastically increased computational cost, compared to codon partition models, but remain computationally interesting alternatives to codon models. Finally, we observe that the substitution patterns in both datasets are drastically different, leading to the conclusion that combined analysis of these two genes using a single model may not be advisable from a context-dependent point of view.
  186. Hu, T. T., Pattyn, P., Bakker, E. G., Cao, J., Cheng, J.-F., Clark, R. M., Fahlgren, N., et al. (2011). The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. NATURE GENETICS, 43(5), 476–481.
    We report the 207-Mb genome sequence of the North American Arabidopsis lyrata strain MN47 based on 8.3x dideoxy sequence coverage. We predict 32,670 genes in this outcrossing species compared to the 27,025 genes in the selfing species Arabidopsis thaliana. The much smaller 125-Mb genome of A. thaliana, which diverged from A. lyrata 10 million years ago, likely constitutes the derived state for the family. We found evidence for DNA loss from large-scale rearrangements, but most of the difference in genome size can be attributed to hundreds of thousands of small deletions, mostly in noncoding DNA and transposons. Analysis of deletions and insertions still segregating in A. thaliana indicates that the process of DNA loss is ongoing, suggesting pervasive selection for a smaller genome. The high-quality reference genome sequence for A. lyrata will be an important resource for functional, evolutionary and ecological studies in the genus Arabidopsis.
  187. Movahedi, S., Van de Peer, Y., & Vandepoele, K. (2011). Comparative network analysis reveals that tissue specificity and gene function are important factors influencing the mode of expression evolution in Arabidopsis and rice. PLANT PHYSIOLOGY, 156(3), 1316–1330.
    Microarray experiments have yielded massive amounts of expression information measured under various conditions for the model species Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa). Expression compendia grouping multiple experiments make it possible to define correlated gene expression patterns within one species and to study how expression has evolved between species. We developed a robust framework to measure expression context conservation (ECC) and found, by analyzing 4,630 pairs of orthologous Arabidopsis and rice genes, that 77% showed conserved coexpression. Examples of nonconserved ECC categories suggested a link between regulatory evolution and environmental adaptations and included genes involved in signal transduction, response to different abiotic stresses, and hormone stimuli. To identify genomic features that influence expression evolution, we analyzed the relationship between ECC, tissue specificity, and protein evolution. Tissue-specific genes showed higher expression conservation compared with broadly expressed genes but were fast evolving at the protein level. No significant correlation was found between protein and expression evolution, implying that both modes of gene evolution are not strongly coupled in plants. By integration of cis-regulatory elements, many ECC conserved genes were significantly enriched for shared DNA motifs, hinting at the conservation of ancestral regulatory interactions in both model species. Surprisingly, for several tissue-specific genes, patterns of concerted network evolution were observed, unveiling conserved coexpression in the absence of conservation of tissue specificity. These findings demonstrate that orthologs inferred through sequence similarity in many cases do not share similar biological functions and highlight the importance of incorporating expression information when comparing genes across species.
  188. Chancerel, E., Lepoittevin, C., Le Provost, G., Lin, Y.-C., Jaramillo-Correa, J. P., Eckert, A. J., Wegrzyn, J. L., et al. (2011). Development and implementation of a highly-multiplexed SNP array for genetic mapping in maritime pine and comparative mapping with loblolly pine. BMC GENOMICS, 12.
    Background: Single nucleotide polymorphisms (SNPs) are the most abundant source of genetic variation among individuals of a species. New genotyping technologies allow examining hundreds to thousands of SNPs in a single reaction for a wide range of applications such as genetic diversity analysis, linkage mapping, fine QTL mapping, association studies, marker-assisted or genome-wide selection. In this paper, we evaluated the potential of highly-multiplexed SNP genotyping for genetic mapping in maritime pine (Pinus pinaster Ait.), the main conifer used for commercial plantation in southwestern Europe. Results: We designed a custom GoldenGate assay for 1,536 SNPs detected through the resequencing of gene fragments (707 in vitro SNPs/Indels) and from Sanger-derived Expressed Sequenced Tags assembled into a unigene set (829 in silico SNPs/Indels). Offspring from three-generation outbred (G2) and inbred (F2) pedigrees were genotyped. The success rate of the assay was 63.6% and 74.8% for in silico and in vitro SNPs, respectively. A genotyping error rate of 0.4% was further estimated from segregating data of SNPs belonging to the same gene. Overall, 394 SNPs were available for mapping. A total of 287 SNPs were integrated with previously mapped markers in the G2 parental maps, while 179 SNPs were localized on the map generated from the analysis of the F2 progeny. Based on 98 markers segregating in both pedigrees, we were able to generate a consensus map comprising 357 SNPs from 292 different loci. Finally, the analysis of sequence homology between mapped markers and their orthologs in a Pinus taeda linkage map, made it possible to align the 12 linkage groups of both species. Conclusions: Our results show that the GoldenGate assay can be used successfully for high-throughput SNP genotyping in maritime pine, a conifer species that has a genome seven times the size of the human genome. This SNP-array will be extended thanks to recent sequencing effort using new generation sequencing technologies and will include SNPs from comparative orthologous sequences that were identified in the present study, providing a wider collection of anchor points for comparative genomics among the conifers.
  189. Van de Peer, Y. (2011). A mystery unveiled. GENOME BIOLOGY.
    A recent phylogenomic study has provided new evidence for two ancient whole genome duplications in plants, with potential importance for the evolution of seed and flowering plants.
  190. Van Landeghem, S., De Baets, B., Van de Peer, Y., & Saeys, Y. (2011). High-precision bio-molecular event extraction from text using parallel binary classifiers. COMPUTATIONAL INTELLIGENCE, 27(4), 645–664.
    We have developed a machine learning framework to accurately extract complex genetic interactions from text. Employing type-specific classifiers, this framework processes research articles to extract various biological events. Subsequently, the algorithm identifies regulation events that take other events as arguments, allowing a nested structure of predictions. All predictions are merged into an integrated network, useful for visualization and for deduction of new biological knowledge. In this paper, we discuss several design choices for an event-based extraction framework. These detailed studies help improving on existing systems, which is illustrated by the relative performance gain of 10% of our system compared to the official results in the recent BioNLP'09 Shared Task. Our framework now achieves state-of-the-art performance with 37.43 recall, 54.81 precision and 44.48 F-score. We further present the first study of feature selection for bio-molecular event extraction from text. While producing more cost-effective models, feature selection can also lead to a better insight into the complexity of the challenge. Finally, this paper tries to bridge the gap between theoretical relation extraction from text and experimental work on bio-molecular interactions by discussing interesting opportunities to employ event-based text mining tools for real-life tasks such as hypothesis generation, database curation and knowledge discovery.
  191. Young, N. D., Debellé, F., Oldroyd, G. E., Geurts, R., Cannon, S. B., Udvardi, M. K., Benedito, V. A., et al. (2011). The Medicago genome provides insight into the evolution of rhizobial symbioses. NATURE, 480(7378), 520–524.
    Legumes (Fabaceae or Leguminosae) are unique among cultivated plants for their ability to carry out endosymbiotic nitrogen fixation with rhizobial bacteria, a process that takes place in a specialized structure known as the nodule. Legumes belong to one of the two main groups of eurosids, the Fabidae, which includes most species capable of endosymbiotic nitrogen fixation(1). Legumes comprise several evolutionary lineages derived from a common ancestor 60 million years ago (Myr ago). Papilionoids are the largest clade, dating nearly to the origin of legumes and containing most cultivated species(2). Medicago truncatula is a long-established model for the study of legume biology. Here we describe the draft sequence of the M. truncatula euchromatin based on a recently completed BAC assembly supplemented with Illumina shotgun sequence, together capturing similar to 94% of all M. truncatula genes. A whole-genome duplication (WGD) approximately 58 Myr ago had a major role in shaping the M. truncatula genome and thereby contributed to the evolution of endosymbiotic nitrogen fixation. Subsequent to the WGD, the M. truncatula genome experienced higher levels of rearrangement than two other sequenced legumes, Glycine max and Lotus japonicus. M. truncatula is a close relative of alfalfa (Medicago sativa), a widely cultivated crop with limited genomics tools and complex autotetraploid genetics. As such, the M. truncatula genome sequence provides significant opportunities to expand alfalfa's genomic toolbox.
  192. Kano, Y., Bjorne, J., Ginter, F., Salakoski, T., Buyko, E., Hahn, U., … Tsujii, J. (2011). U-Compare bio-event meta-service : compatible BioNLP event extraction services. BMC BIOINFORMATICS, 12.
    Background: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. Results: We have integrated nine event extraction systems in the U-Compare framework, making them inter-compatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. Conclusions: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.
  193. Van Landeghem, S., Ginter, F., Van de Peer, Y., & Salakoski, T. (2011). EVEX: a PubMed-scale resource for homology-based generalization of text mining predictions. Proceedings of the 2011 workshop on biomedical natural language processing (pp. 28–37). Presented at the Workshop on Biomedical Natural Language Processing (ACL-HLT 2011), Association for Computational Linguistics (ACL).
    In comparative genomics, functional annotations are transferred from one organism to another relying on sequence similarity. With more than 20 million citations in PubMed, text mining provides the ideal tool for generating additional large-scale homology-based predictions. To this end, we have refined a recent dataset of biomolecular events extracted from text, and integrated these predictions with records from public gene databases. Accounting for lexical variation of gene symbols, we have implemented a disambiguation algorithm that uniquely links the arguments of 11.2 million biomolecular events to well-defined gene families, providing interesting opportunities for query expansion and hypothesis generation. The resulting MySQL database, including all 19.2 million original events as well as their homology-based variants, is publicly available at http://bionlp.utu.fi/.
  194. Michoel, T., Joshi, A., Nachtergaele, B., & Van de Peer, Y. (2011). Enrichment and aggregation of topological motifs are independent organizational principles of integrated interaction networks. MOLECULAR BIOSYSTEMS, 7(10), 2769–2778.
    Topological network motifs represent functional relationships within and between regulatory and protein-protein interaction networks. Enriched motifs often aggregate into self-contained units forming functional modules. Theoretical models for network evolution by duplication-divergence mechanisms and for network topology by hierarchical scale-free networks have suggested a one-to-one relation between network motif enrichment and aggregation, but this relation has never been tested quantitatively in real biological interaction networks. Here we introduce a novel method for assessing the statistical significance of network motif aggregation and for identifying clusters of overlapping network motifs. Using an integrated network of transcriptional, posttranslational and protein-protein interactions in yeast we show that network motif aggregation reflects a local modularity property which is independent of network motif enrichment. In particular our method identified novel functional network themes for a set of motifs which are not enriched yet aggregate significantly and challenges the conventional view that network motif enrichment is the most basic organizational principle of complex networks.
  195. Joshi, Anagha, Van de Peer, Y., & Michoel, T. (2011). Structural and functional organization of RNA regulons in the post-transcriptional regulatory network of yeast. NUCLEIC ACIDS RESEARCH, 39(21), 9108–9117.
    Post-transcriptional control of mRNA transcript processing by RNA binding proteins (RBPs) is an important step in the regulation of gene expression and protein production. The post-transcriptional regulatory network is similar in complexity to the transcriptional regulatory network and is thought to be organized in RNA regulons, coherent sets of functionally related mRNAs combinatorially regulated by common RBPs. We integrated genome-wide transcriptional and translational expression data in yeast with large-scale regulatory networks of transcription factor and RBP binding interactions to analyze the functional organization of post-transcriptional regulation and RNA regulons at a system level. We found that post-transcriptional feedback loops and mixed bifan motifs are overrepresented in the integrated regulatory network and control the coordinated translation of RNA regulons, manifested as clusters of functionally related mRNAs which are strongly coexpressed in the translatome data. These translatome clusters are more functionally coherent than transcriptome clusters and are expressed with higher mRNA and protein levels and less noise. Our results show how the post-transcriptional network is intertwined with the transcriptional network to regulate gene expression in a coordinated way and that the integration of heterogeneous genome-wide datasets allows to relate structure to function in regulatory networks at a system level.
  196. Grbić, M., Van Leeuwen, T., Clark, R. M., Rombauts, S., Rouzé, P., Grbić, V., Osborne, E. J., et al. (2011). The genome of Tetranychus urticae reveals herbivorous pest adaptations. NATURE, 479(7374), 487–492.
    The spider mite Tetranychus urticae is a cosmopolitan agricultural pest with an extensive host plant range and an extreme record of pesticide resistance. Here we present the completely sequenced and annotated spider mite genome, representing the first complete chelicerate genome. At 90 megabases T. urticae has the smallest sequenced arthropod genome. Compared with other arthropods, the spider mite genome shows unique changes in the hormonal environment and organization of the Hox complex, and also reveals evolutionary innovation of silk production. We find strong signatures of polyphagy and detoxification in gene families associated with feeding on different hosts and in new gene families acquired by lateral gene transfer. Deep transcriptome analysis of mites feeding on different plants shows how this pest responds to a changing host environment. The T. urticae genome thus offers new insights into arthropod evolution and plant-herbivore interactions, and provides unique opportunities for developing novel plant protection strategies.
  197. Coyne, R. S., Hannick, L., Shanmugam, D., Hostetler, J. B., Brami, D., Joardar, V. S., Johnson, J., et al. (2011). Comparative genomics of the pathogenic ciliate Ichthyophthirius multifiliis, its free-living relatives and a host species provide insights into adoption of a parasitic lifestyle and prospects for disease control. GENOME BIOLOGY, 12(10).
    BACKGROUND: Ichthyophthirius multifiliis, commonly known as Ich, is a highly pathogenic ciliate responsible for 'white spot', a disease causing significant economic losses to the global aquaculture industry. Options for disease control are extremely limited, and Ich's obligate parasitic lifestyle makes experimental studies challenging. Unlike most well-studied protozoan parasites, Ich belongs to a phylum composed primarily of free-living members. Indeed, it is closely related to the model organism Tetrahymena thermophila. Genomic studies represent a promising strategy to reduce the impact of this disease and to understand the evolutionary transition to parasitism. RESULTS: We report the sequencing, assembly and annotation of the Ich macronuclear genome. Compared with its free-living relative T. thermophila, the Ich genome is reduced approximately two-fold in length and gene density and three-fold in gene content. We analyzed in detail several gene classes with diverse functions in behavior, cellular function and host immunogenicity, including protein kinases, membrane transporters, proteases, surface antigens and cytoskeletal components and regulators. We also mapped by orthology Ich's metabolic pathways in comparison with other ciliates and a potential host organism, the zebrafish Danio rerio. CONCLUSIONS: Knowledge of the complete protein-coding and metabolic potential of Ich opens avenues for rational testing of therapeutic drugs that target functions essential to this parasite but not to its fish hosts. Also, a catalog of surface protein-encoding genes will facilitate development of more effective vaccines. The potential to use T. thermophila as a surrogate model offers promise toward controlling 'white spot' disease and understanding the adaptation to a parasitic lifestyle.
  198. Proost, Sebastian, Pattyn, P., Gerats, T., & Van de Peer, Y. (2011). Journey through the past: 150 million years of plant genome evolution. PLANT JOURNAL, 66(1), 58–65.
  199. Armananzas, R., Saeys, Y., Inza, I., Garcia-Torres, M., Bielza, C., Van de Peer, Y., & Larranaga, P. (2011). Peakbin selection in mass spectrometry data using a consensus approach with estimation of distribution algorithms. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 8(3), 760–774.
    Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mass spectrometry (MS) data generally contain a limited number of samples described by a high number of features. In this work, we present a novel class of evolutionary algorithms, estimation of distribution algorithms (EDA), as an efficient peak selector in this MS domain. There is a trade-of f between the reliability of the detected biomarkers and the low number of samples for analysis. For this reason, we introduce a consensus approach, built upon the classical EDA scheme, that improves stability and robustness of the final set of relevant peaks. An entire data workflow is designed to yield unbiased results. Four publicly available MS data sets (two MALDI-TOF and another two SELDI-TOF) are analyzed. The results are compared to the original works, and a new plot (peak frequential plot) for graphically inspecting the relevant peaks is introduced. A complete online supplementary page, which can be found at http://www.sc.ehu.es/ccwbayes/members/ruben/ms, includes extended info and results, in addition to Matlab scripts and references.
  200. Fostier, J., Proost, S., Dhoedt, B., Saeys, Y., Demeester, P., Van de Peer, Y., & Vandepoele, K. (2011). A greedy, graph-based algorithm for the alignment of multiple homologous gene lists. BIOINFORMATICS, 27(6), 749–756.
  201. Van de Peer, Y. (2011). Genomes: the truth is in there. EMBO REPORTS.
  202. Duplessis, S., Cuomo, C. A., Lin, Y.-C., Aerts, A., Tisserant, E., Veneault-Fourrey, C., Joly, D. L., et al. (2011). Obligate biotrophy features unraveled by the genomic analysis of rust fungi. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 108(22), 9166–9171.
    Rust fungi are some of the most devastating pathogens of crop plants. They are obligate biotrophs, which extract nutrients only from living plant tissues and cannot grow apart from their hosts. Their lifestyle has slowed the dissection of molecular mechanisms underlying host invasion and avoidance or suppression of plant innate immunity. We sequenced the 101-Mb genome of Melampsora larici-populina, the causal agent of poplar leaf rust, and the 89-Mb genome of Puccinia graminis f. sp. tritici, the causal agent of wheat and barley stem rust. We then compared the 16,399 predicted proteins of M. larici-populina with the 17,773 predicted proteins of P. graminis f. sp tritici. Genomic features related to their obligate biotrophic lifestyle include expanded lineage-specific gene families, a large repertoire of effector-like small secreted proteins, impaired nitrogen and sulfur assimilation pathways, and expanded families of amino acid and oligopeptide membrane transporters. The dramatic up-regulation of transcripts coding for small secreted proteins, secreted hydrolytic enzymes, and transporters in planta suggests that they play a role in host infection and nutrient acquisition. Some of these genomic hallmarks are mirrored in the genomes of other microbial eukaryotes that have independently evolved to infect plants, indicating convergent adaptation to a biotrophic existence inside plant cells.
  203. Audenaert, P., Van Parys, T., Brondel, F., Pickavet, M., Demeester, P., Van de Peer, Y., & Michoel, T. (2011). CyClus3D: a Cytoscape plugin for clustering network motifs in integrated networks. BIOINFORMATICS, 27(11), 1587–1588.
    Network motifs in integrated molecular networks represent functional relationships between distinct data types. They aggregate to form dense topological structures corresponding to functional modules which cannot be detected by traditional graph clustering algorithms. We developed CyClus3D, a Cytoscape plugin for clustering composite three-node network motifs using a 3D spectral clustering algorithm.
  204. Cock, J. M., Collén, J., Sterck, L., Rouzé, P., Scornet, D., Anthouard, V., … Wincker, P. (2011). Nature, nurture and the structure of macroalgal genomes. In EUROPEAN JOURNAL OF PHYCOLOGY (Vol. 46, pp. 39–39). Rhodes, Greece.
  205. Yao, Yao, Baele, G., & Van de Peer, Y. (2011). A bio-inspired agent-based system for controlling robot behaviour. 2011 IEEE symposium on intelligent agent (IA). Presented at the 2011 IEEE Symposium on Intelligent Agent (IA), New York, NY, USA: IEEE.
    In this paper, we present an agent-based system to control a single robot’s behaviour. We present an artificial genome structure, based on gene regulatory networks, in which several regions can be distinguished such as promoter regions, indicator genes, transcription factor binding sites, regulatory genes and expressed genes. We use agent-based modeling (ABM) to simulate a bio-inspired system based on the artificial genome, with the ultimate goal of providing phenotypic information for a simulated robot. We show that the presence of a feedback loop in the agent based system, along with the corresponding agent replacements, is essential to allow the robot to perform its tasks.
  206. Van Landeghem, S., Pyysalo, S., Ohta, T., & Van de Peer, Y. (2010). Integration of static relations to enhance event extraction from text. Proceedings of the 2010 workshop on biomedical natural language processing (pp. 144–152). Presented at the 2010 Workshop on Biomedical Natural Language Processing (ACL 2010), Association for Computational Linguistics (ACL).
  207. Saeys, Y., Van Landeghem, S., & Van de Peer, Y. (2010). Event based text mining for integrated network construction. In S. Džeroski, P. Geurts, & J. Rousu (Eds.), JMLR Workshop and Conference Proceedings (Vol. 8, pp. 112–121). Presented at the 3rd International workshop on Machine Learning in Systems Biology (MLSB 2009), Brookline, MA, USA: Microtome Publishing.
    The scientific literature is a rich and challenging data source for research in systems biology, providing numerous interactions between biological entities. Text mining techniques have been increasingly useful to extract such information from the literature in an automatic way, but up to now the main focus of text mining in the systems biology field has been restricted mostly to the discovery of protein-protein interactions. Here, we take this approach one step further, and use machine learning techniques combined with text mining to extract a much wider variety of interactions between biological entities. Each particular interaction type gives rise to a separate network, represented as a graph, all of which can be subsequently combined to yield a so-called integrated network representation. This provides a much broader view on the biological system as a whole, which can then be used in further investigations to analyse specific properties of the network
  208. Van Leene, J., Hollunder, J., Eeckhout, D., Persiau, G., Van De Slijke, E., Stals, H., Van Isterdael, G., et al. (2010). Targeted interactomics reveals a complex core cell cycle machinery in Arabidopsis thaliana. MOLECULAR SYSTEMS BIOLOGY, 6.
  209. Bonnet, E., He, Y., Billiau, K., & Van de Peer, Y. (2010). TAPIR, a web server for the prediction of plant microRNA targets, including target mimics. BIOINFORMATICS, 26(12), 1566–1568.
    We present a new web server called TAPIR, designed for the prediction of plant microRNA targets. The server offers the possibility to search for plant miRNA targets using a fast and a precise algorithm. The precise option is much slower but guarantees to find less perfectly paired miRNA-target duplexes. Furthermore, the precise option allows the prediction of target mimics, which are characterized by a miRNA-target duplex having a large loop, making them undetectable by traditional tools.
  210. Bonnet, E., Michoel, T., & Van de Peer, Y. (2010). Prediction of a gene regulatory network linked to prostate cancer from gene expression, microRNA and clinical data. BIOINFORMATICS, 26(18), i638–i644. Presented at the 9th European Conference on Computational Biology.
    Motivation: Cancer is a complex disease, triggered by mutations in multiple genes and pathways. There is a growing interest in the application of systems biology approaches to analyze various types of cancer-related data to understand the overwhelming complexity of changes induced by the disease. Results: We reconstructed a regulatory module network using gene expression, microRNA expression and a clinical parameter, all measured in lymphoblastoid cell lines derived from patients having aggressive or non-aggressive forms of prostate cancer. Our analysis identified several modules enriched in cell cycle-related genes as well as novel functional categories that might be linked to prostate cancer. Almost one-third of the regulators predicted to control the expression levels of the modules are microRNAs. Several of them have already been characterized as causal in various diseases, including cancer. We also predicted novel microRNAs that have never been associated to this type of tumor. Furthermore, the condition-dependent expression of several modules could be linked to the value of a clinical parameter characterizing the aggressiveness of the prostate cancer. Taken together, our results help to shed light on the consequences of aggressive and non-aggressive forms of prostate cancer.
  211. Van Landeghem, S., Abeel, T., Saeys, Y., & Van de Peer, Y. (2010). Discriminative and informative features for biomolecular text mining with ensemble feature selection. BIOINFORMATICS, 26(18), i554–i560. Presented at the 9th European Conference on Computational Biology.
    Motivation: In the field of biomolecular text mining, black box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection (FS) is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows us to build more accurate classifiers while at the same time bridging the gap between the black box behavior and the end-user who has to interpret the results. Results: We show that our FS methodology successfully discards a large fraction of machine-generated features, improving classification performance of state-of-the-art text mining algorithms. Furthermore, we illustrate how FS can be applied to gain understanding in the predictions of a framework for biomolecular event extraction from text. We include numerous examples of highly discriminative features that model either biological reality or common linguistic constructs. Finally, we discuss a number of insights from our FS analyses that will provide the opportunity to considerably improve upon current text mining tools. Availability: The FS algorithms and classifiers are available in Java-ML (http://java-ml.sf.net). The datasets are publicly available from the BioNLP'09 Shared Task web site (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).
  212. Velasco, R., Zharkikh, A., Affourtit, J., Dhingra, A., Cestaro, A., Kalyanaraman, A., Fontana, P., et al. (2010). The genome of the domesticated apple (Malus x domestica Borkh.). NATURE GENETICS, 42(10), 833–839.
    We report a high-quality draft genome sequence of the domesticated apple (Malus x domestica). We show that a relatively recent (> 50 million years ago) genome-wide duplication (GWD) has resulted in the transition from nine ancestral chromosomes to 17 chromosomes in the Pyreae. Traces of older GWDs partly support the monophyly of the ancestral paleohexaploidy of eudicots. Phylogenetic reconstruction of Pyreae and the genus Malus, relative to major Rosaceae taxa, identified the progenitor of the cultivated apple as M. sieversii. Expansion of gene families reported to be involved in fruit development may explain formation of the pome, a Pyreae-specific false fruit that develops by proliferation of the basal part of the sepals, the receptacle. In apple, a subclade of MADS-box genes, normally involved in flower and fruit development, is expanded to include 15 members, as are other gene families involved in Rosaceae-specific metabolism, such as transport and assimilation of sorbitol.
  213. Baele, G., Van de Peer, Y., & Vansteelandt, S. (2010). Modelling the ancestral sequence distribution and model frequencies in context-dependent models for primate non-coding sequences. BMC EVOLUTIONARY BIOLOGY, 10.
    Background: Recent approaches for context-dependent evolutionary modelling assume that the evolution of a given site depends upon its ancestor and that ancestor's immediate flanking sites. Because such dependency pattern cannot be imposed on the root sequence, we consider the use of different orders of Markov chains to model dependence at the ancestral root sequence. Root distributions which are coupled to the context-dependent model across the underlying phylogenetic tree are deemed more realistic than decoupled Markov chains models, as the evolutionary process is responsible for shaping the composition of the ancestral root sequence. Results: We find strong support, in terms of Bayes Factors, for using a second-order Markov chain at the ancestral root sequence along with a context-dependent model throughout the remainder of the phylogenetic tree in an ancestral repeats dataset, and for using a first-order Markov chain at the ancestral root sequence in a pseudogene dataset. Relaxing the assumption of a single context-independent set of independent model frequencies as presented in previous work, yields a further drastic increase in model fit. We show that the substitution rates associated with the CpG-methylation-deamination process can be modelled through context-dependent model frequencies and that their accuracy depends on the (order of the) Markov chain imposed at the ancestral root sequence. In addition, we provide evidence that this approach (which assumes that root distribution and evolutionary model are decoupled) outperforms an approach inspired by the work of Arndt et al., where the root distribution is coupled to the evolutionary model. We show that the continuous-time approximation of Hwang and Green has stronger support in terms of Bayes Factors, but the parameter estimates show minimal differences. Conclusions: We show that the combination of a dependency scheme at the ancestral root sequence and a context-dependent evolutionary model across the remainder of the tree allows for accurate estimation of the model's parameters. The different assumptions tested in this manuscript clearly show that designing accurate context-dependent models is a complex process, with many different assumptions that require validation. Further, these assumptions are shown to change across different datasets, making the search for an adequate model for a given dataset quite challenging.
  214. Amoutzias, G., & Van de Peer, Y. (2010). Single-gene and whole-genome duplications and the evolution of protein-protein interaction networks. In G. Caetano-Anollés (Ed.), Evolutionary genomics and systems biology (pp. 413–429). Hoboken, NJ, USA: Wiley-Blackwell.
  215. Sanchez-Rodriguez, A., Martens, C., Engelen, K., Van de Peer, Y., & Marchal, K. (2010). The potential for pathogenicity was present in the ancestor of the Ascomycete subphylum Pezizomycotina. BMC EVOLUTIONARY BIOLOGY, 10.
  216. Martens, C., & Van de Peer, Y. (2010). The hidden duplication past of the plant pathogen Phytophthora and its consequences for infection. BMC GENOMICS, 11.
    Background: Oomycetes of the genus Phytophthora are pathogens that infect a wide range of plant species. For dicot hosts such as tomato, potato and soybean, Phytophthora is even the most important pathogen. Previous analyses of Phytophthora genomes uncovered many genes, large gene families and large genome sizes that can partially be explained by significant repeat expansion patterns. Results: Analysis of the complete genomes of three different Phytophthora species, using a newly developed approach, unveiled a large number of small duplicated blocks, mainly consisting of two or three consecutive genes. Further analysis of these duplicated genes and comparison with the known gene and genome duplication history of ten other eukaryotes including parasites, algae, plants, fungi, vertebrates and invertebrates, suggests that the ancestor of P. infestans, P. sojae and P. ramorum most likely underwent a whole genome duplication (WGD). Genes that have survived in duplicate are mainly genes that are known to be preferentially retained following WGDs, but also genes important for pathogenicity and infection of the different hosts seem to have been retained in excess. As a result, the WGD might have contributed to the evolutionary and pathogenic success of Phytophthora. Conclusions: The fact that we find many small blocks of duplicated genes indicates that the genomes of Phytophthora species have been heavily rearranged following the WGD. Most likely, the high repeat content in these genomes have played an important role in this rearrangement process. As a consequence, the paucity of retained larger duplicated blocks has greatly complicated previous attempts to detect remnants of a large-scale duplication event in Phytophthora. However, as we show here, our newly developed strategy to identify very small duplicated blocks might be a useful approach to uncover ancient polyploidy events, in particular for heavily rearranged genomes.
  217. Baele, Guy, Van de Peer, Y., & Vansteelandt, S. (2010). Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences. JOURNAL OF MOLECULAR EVOLUTION, 71(1), 34–50.
    We discuss the importance of non-reversible evolutionary models when analyzing context-dependence. Given the inherent non-reversible nature of the well-known CpG-methylation-deamination process in mammalian evolution, non-reversible context-dependent evolutionary models may be well able to accurately model such a process. In particular, the lack of constraints on non-reversible substitution models might allow for more accurate estimation of context-dependent substitution parameters. To demonstrate this, we have developed different time-homogeneous context-dependent evolutionary models to analyze a large genomic dataset of primate ancestral repeats based on existing independent evolutionary models. We have calculated the difference in model fit for each of these models using Bayes Factors obtained via thermodynamic integration. We find that non-reversible context-dependent models can drastically increase model fit when compared to independent models and this on two primate non-coding datasets. Further, we show that further improvements are possible by clustering similar parameters across contexts.
  218. Fawcett, J., & Van de Peer, Y. (2010). Angiosperm polyploids and their road to evolutionary success. TRENDS IN EVOLUTIONARY BIOLOGY, 2(1), 16–21.
    The abundance of polyploidy among flowering plants has long been recognized, and recent studies have uncovered multiple ancient polyploidization events in the evolutionary history of several angiosperm lineages. Once polyploids are formed they must get locally established and then propagate and survive while adapting to different environments and avoiding extinction. This might ultimately lead to their long-term evolutionary success, where their descendant lineages survive for tens of millions of years. Along this road to evolutionary success, polyploids must overcome several obstacles, to which several genetic and ecological factors are likely to contribute. One recurrent observation, based on present-day polyploids, has been the high frequency of polyploids in harsh environments. Also, recent studies proposed that the success of certain ancient polyploids might be linked to periods of climatic change. Although we are still in the early stages of unraveling the factors that resulted in the long-term evolutionary success of ancient polyploids, the advances in genomic sequencing and molecular dating methods promise to enhance our understanding. It, therefore, seems timely to review our current knowledge of what determines the success of polyploids. Here, we discuss especially how harsh conditions or periods of climatic change might affect the rate of formation, establishment, persistence and long-term evolutionary success of polyploids in angiosperms.
  219. Maere, S., & Van de Peer, Y. (2010). Duplicate retention after small- and large-scale duplications. In K. Dittmar & D. Liberles (Eds.), Evolution after gene duplication (pp. 31–56). Hoboken, NJ, USA: John Wiley & Sons.
  220. Abeel, T., Van Landeghem, S., Morante, R., Van Asch, V., Van de Peer, Y., Daelemans, W., & Saeys, Y. (2010). Highlights of the BioTM 2010 workshop on advances in bio text mining. BMC BIOINFORMATICS.
    This meeting report gives an overview of the keynote lectures, the panel discussion and a selection of the contributed presentations. The workshop was held in Gent, Belgium on May 10-11. It featured a tutorial aimed towards a broad audience of (computational) biologists, (computational) linguists and researchers working purely on text mining.
  221. Michoel, T., Joshi, A. A., Bonnet, E., Vermeirssen, V., & Van de Peer, Y. (2010). Towards system level modeling of functional modules and regulatory pathways using genome-scale data. In Proceedings of the Seventh International Workshop on Computational Systems Biology (WCSB 2010) (pp. 71–74).
  222. Rehrauer, H., Aquino, C., Gruissem, W., Henz, S. R., Hilson, P., Laubinger, S., Naouar, N., et al. (2010). AGRONOMICS1: A New Resource for Arabidopsis Transcriptome Profiling. PLANT PHYSIOLOGY, 152(2), 487–499.
  223. Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., & Saeys, Y. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. BIOINFORMATICS, 26(3), 392–398.
    Motivation: Biomarker discovery is an important topic in biomedical applications of computational biology, including applications such as gene and SNP selection from high-dimensional data. Surprisingly, the stability with respect to sampling variation or robustness of such selection processes has received attention only recently. However, robustness of biomarkers is an important issue, as it may greatly influence subsequent biological validations. In addition, a more robust set of markers may strengthen the confidence of an expert in the results of a selection method. Results: Our first contribution is a general framework for the analysis of the robustness of a biomarker selection algorithm. Secondly, we conducted a large-scale analysis of the recently introduced concept of ensemble feature selection, where multiple feature selections are combined in order to increase the robustness of the final set of selected features. We focus on selection methods that are embedded in the estimation of support vector machines (SVMs). SVMs are powerful classification models that have shown state-of-the- art performance on several diagnosis and prognosis tasks on biological data. Their feature selection extensions also offered good results for gene selection tasks. We show that the robustness of SVMs for biomarker discovery can be substantially increased by using ensemble feature selection techniques, while at the same time improving upon classification performances. The proposed methodology is evaluated on four microarray datasets showing increases of up to almost 30% in robustness of the selected biomarkers, along with an improvement of similar to 15% in classification performance. The stability improvement with ensemble methods is particularly noticeable for small signature sizes (a few tens of genes), which is most relevant for the design of a diagnosis or prognosis model from a gene signature.
  224. Van de Peer, Y., Maere, S., & Meyer, A. (2010). 2R or not 2R is not the question anymore. NATURE REVIEWS GENETICS, 11(2), 166–166.
  225. Huysman, M., Martens, C., Vandepoele, K., Gillard, J., Rayko, E., Heijde, M., Bowler, C., et al. (2010). Genome-wide analysis of the diatom cell cycle unveils a novel type of cyclins involved in environmental signaling. GENOME BIOLOGY, 11(2).
    Background : Despite the enormous importance of diatoms in aquatic ecosystems and their broad industrial potential, little is known about their life cycle control. Diatoms typically inhabit rapidly changing and unstable environments, suggesting that cell cycle regulation in diatoms must have evolved to adequately integrate various environmental signals. The recent genome sequencing of Thalassiosira pseudonana and Phaeodactylum tricornutum allows us to explore the molecular conservation of cell cycle regulation in diatoms. Results : By profile-based annotation of cell cycle genes, counterparts of conserved as well as new regulators were identified in T. pseudonana and P. tricornutum. In particular, the cyclin gene family was found to be expanded extensively compared to that of other eukaryotes and a novel type of cyclins was discovered, the diatom-specific cyclins. We established a synchronization method for P. tricornutum that enabled assignment of the different annotated genes to specific cell cycle phase transitions. The diatom-specific cyclins are predominantly expressed at the G1-to-S transition and some respond to phosphate availability, hinting at a role in connecting cell division to environmental stimuli. Conclusion : The discovery of highly conserved and new cell cycle regulators suggests the evolution of unique control mechanisms for diatom cell division, probably contributing to their ability to adapt and survive under highly fluctuating environmental conditions.
  226. Amoutzias, G., He, Y., Gordon, J., Mossialos, D., Oliver, S. G., & Van de Peer, Y. (2010). Posttranslational regulation impacts the fate of duplicated genes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 107(7), 2967–2971.
    Gene and genome duplications create novel genetic material on which evolution can work and have therefore been recognized as a major source of innovation for many eukaryotic lineages. Following duplication, the most likely fate is gene loss; however, a considerable fraction of duplicated genes survive. Not all genes have the same probability of survival, but it is not fully understood what evolutionary forces determine the pattern of gene retention. Here, we use genome sequence data as well as large-scale phosphoproteomics data from the baker's yeast Saccharomyces cerevisiae, which underwent a whole-genome duplication similar to 100 mya, and show that the number of phosphorylation sites on the proteins they encode is a major determinant of gene retention. Protein phosphorylation motifs are short amino acid sequences that are usually embedded within unstructured and rapidly evolving protein regions. Reciprocal loss of those ancestral sites and the gain of new ones are major drivers in the retention of the two surviving duplicates and in their acquisition of distinct functions. This way, small changes in the sequences of unstructured regions in proteins can contribute to the rapid rewiring and adaptation of regulatory networks.
  227. Bonnet, E., Tatari, M., Joshi, A. M., Michoel, T., Marchal, K., Berx, G., & Van de Peer, Y. (2010). Module network inference from a cancer gene expression data set identifies microRNA regulated modules. PLOS ONE, 5(4).
    Background: MicroRNAs (miRNAs) are small RNAs that recognize and regulate mRNA target genes. Multiple lines of evidence indicate that they are key regulators of numerous critical functions in development and disease, including cancer. However, defining the place and function of miRNAs in complex regulatory networks is not straightforward. Systems approaches, like the inference of a module network from expression data, can help to achieve this goal. Methodology/Principal Findings: During the last decade, much progress has been made in the development of robust and powerful module network inference algorithms. In this study, we analyze and assess experimentally a module network inferred from both miRNA and mRNA expression data, using our recently developed module network inference algorithm based on probabilistic optimization techniques. We show that several miRNAs are predicted as statistically significant regulators for various modules of tightly co-expressed genes. A detailed analysis of three of those modules demonstrates that the specific assignment of miRNAs is functionally coherent and supported by literature. We further designed a set of experiments to test the assignment of miR-200a as the top regulator of a small module of nine genes. The results strongly suggest that miR-200a is regulating the module genes via the transcription factor ZEB1. Interestingly, this module is most likely involved in epithelial homeostasis and its dysregulation might contribute to the malignant process in cancer cells. Conclusions/Significance: Our results show that a robust module network analysis of expression data can provide novel insights of miRNA function in important cellular processes. Such a computational approach, starting from expression data alone, can be helpful in the process of identifying the function of miRNAs by suggesting modules of co-expressed genes in which they play a regulatory role. As shown in this study, those modules can then be tested experimentally to further investigate and refine the function of the miRNA in the regulatory network.
  228. Joshi, A. M., Van Parys, T., Van de Peer, Y., & Michoel, T. (2010). Characterizing regulatory path motifs in integrated networks using perturbational data. GENOME BIOLOGY, 11(3).
    We introduce Pathicular http://bioinformatics.psb.ugent.be/software/details/Pathicular, a Cytoscape plugin for studying the cellular response to perturbations of transcription factors by integrating perturbational expression data with transcriptional, protein-protein and phosphorylation networks. Pathicular searches for 'regulatory path motifs', short paths in the integrated physical networks which occur significantly more often than expected between transcription factors and their targets in the perturbational data. A case study in Saccharomyces cerevisiae identifies eight regulatory path motifs and demonstrates their biological significance.
  229. Cock, J. M., Sterck, L., Rouzé, P., Scornet, D., Allen, A. E., Amoutzias, G., … Wincker, P. (2010). The Ectocarpus genome and the independent evolution of multicellularity in brown algae. NATURE, 465(7298), 617–621.
    Brown algae (Phaeophyceae) are complex photosynthetic organisms with a very different evolutionary history to green plants, to which they are only distantly related(1). These seaweeds are the dominant species in rocky coastal ecosystems and they exhibit many interesting adaptations to these, often harsh, environments. Brown algae are also one of only a small number of eukaryotic lineages that have evolved complex multicellularity (Fig. 1). We report the 214 million base pair (Mbp) genome sequence of the filamentous seaweed Ectocarpus siliculosus (Dillwyn) Lyngbye, a model organism for brown algae(2-5), closely related to the kelps(6,7) (Fig. 1). Genome features such as the presence of an extended set of light-harvesting and pigment biosynthesis genes and new metabolic processes such as halide metabolism help explain the ability of this organism to cope with the highly variable tidal environment. The evolution of multicellularity in this lineage is correlated with the presence of a rich array of signal transduction genes. Of particular interest is the presence of a family of receptor kinases, as the independent evolution of related molecules has been linked with the emergence of multicellularity in both the animal and green plant lineages. The Ectocarpus genome sequence represents an important step towards developing this organism as a model species, providing the possibility to combine genomic and genetic(2) approaches to explore these and other(4,5) aspects of brown algal biology further.
  230. Kernbach, S., Hamann, H., Stradner, J., Thenius, R., Schmickl, T., Crailsheim, K., van Rossum, A. C., et al. (2009). On adaptive self-organization in artificial robot organisms. 2009 Computation world : future computing, service computation, cognitive, adaptive, content, patterns conference (pp. 33–43). Presented at the 2009 Computation World : Future computing, service computation, cognitive, adaptive, content, patterns conference, New York, NY, USA: IEEE.
    Self-organization in natural systems demonstrates very reliable and scalable collective behavior without using any central elements. When providing collective robotic systems with self-organizing principles, we are facing new problems of making self-organization purposeful, self-adapting to changing environments and faster, in order to meet requirements from a technical perspective. This paper describes on-going work of creating such an artificial self-organization within artificial robot organisms, performed in the framework of several European projects.
  231. Joshi, A. M., De Smet, R., Marchal, K., Van de Peer, Y., & Michoel, T. (2009). Module networks revisited: computational assessment and prioritization of model predictions. BIOINFORMATICS, 25(4), 490–496.
    Motivation: The solution of high-dimensional inference and prediction problems in computational biology is almost always a compromise between mathematical theory and practical constraints, such as limited computational resources. As time progresses, computational power increases but well-established inference methods often remain locked in their initial suboptimal solution. Results: We revisit the approach of Segal et al. to infer regulatory modules and their condition-specific regulators from gene expression data. In contrast to their direct optimization-based solution, we use a more representative centroid-like solution extracted from an ensemble of possible statistical models to explain the data. The ensemble method automatically selects a subset of most informative genes and builds a quantitatively better model for them. Genes which cluster together in the majority of models produce functionally more coherent modules. Regulators which are consistently assigned to a module are more often supported by literature, but a single model always contains many regulator assignments not supported by the ensemble. Reliably detecting condition-specific or combinatorial regulation is particularly hard in a single optimum but can be achieved using ensemble averaging.
  232. Van Landeghem, S., Saeys, Y., De Baets, B., & Van de Peer, Y. (2009). Analyzing text in search of bio-molecular events: a high-precision machine learning framework. Proceedings of the workshop on BioNLP : shared task (pp. 128–136). Presented at the Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop, Association for Computational Linguistics (ACL).
    The BioNLP'09 Shared Task on Event Extraction is a challenge which concerns the detection of bio-molecular events from text. In this paper, we present a detailed account of the challenges encountered during the construction of a machine learning framework for participation in this task. We have focused our work mainly around the filtering of false positives, creating a high-precision extraction method. We have tested techniques such as SVMs, feature selection and various filters for data pre- and post-processing, and report on the influence on performance for each of them. To detect negation and speculation in text, we describe a custom-made rule-based system which is simple in design, but effective in performance.
  233. Van de Peer, Y. (2009). Phylogenetic inference based on distance methods : theory. In P. Lemey, M. Salemi, & A.-M. Vandamme (Eds.), The phylogenetic handbook : a practical approach to phylogenetic analysis and hypothesis testing (pp. 142–160). Cambridge, UK: Cambridge University Press.
  234. Baele, G., Van de Peer, Y., & Vansteelandt, S. (2009). Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences. BMC Evolutionary Biology, 9, 87.1–87.23.
    Background: Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations. Results: We have developed two model-search algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via model-switch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of context-dependent evolution may increase model fit considerably and that the combination of a context-dependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that context-dependence may only become detectable upon applying model-building strategies. Conclusion: While context-dependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a context-dependent model and the performance in terms of model fit, we have designed two parameter-reduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a context-dependent evolutionary model.
  235. Vandepoele, Klaas, Quimbaya Gomez, M. A., Casneuf, T., De Veylder, L., & Van de Peer, Y. (2009). Unraveling Transcriptional Control in Arabidopsis Using cis-Regulatory Elements and Coexpression Networks. Plant Physiology, 150(2), 535–546.
    Analysis of gene expression data generated by high-throughput microarray transcript profiling experiments has demonstrated that genes with an overall similar expression pattern are often enriched for similar functions. This guilt-by-association principle can be applied to define modular gene programs, identify cis-regulatory elements, or predict gene functions for unknown genes based on their coexpression neighborhood. We evaluated the potential to use Gene Ontology (GO) enrichment of a gene's coexpression neighborhood as a tool to predict its function but found overall low sensitivity scores (13%-34%). This indicates that for many functional categories, coexpression alone performs poorly to infer known biological gene functions. However, integration of cis-regulatory elements shows that 46% of the gene coexpression neighborhoods are enriched for one or more motifs, providing a valuable complementary source to functionally annotate genes. Through the integration of coexpression data, GO annotations, and a set of known cis-regulatory elements combined with a novel set of evolutionarily conserved plant motifs, we could link many genes and motifs to specific biological functions. Application of our coexpression framework extended with cis-regulatory element analysis on transcriptome data from the cell cycle-related transcription factor OBP1 yielded several coexpressed modules associated with specific cis-regulatory elements. Moreover, our analysis strongly suggests a feed-forward regulatory interaction between OBP1 and the E2F pathway. The ATCOECIS resource (http:// bioinformatics.psb.ugent.be/ATCOECIS/) makes it possible to query coexpression data and GO and cis-regulatory element annotations and to submit user-defined gene sets for motif analysis, providing an access point to unravel the regulatory code underlying transcriptional control in Arabidopsis (Arabidopsis thaliana).
  236. Vermeirssen, V., Joshi, A. M., Michoel, T., Bonnet, E., Casneuf, T., & Van de Peer, Y. (2009). Transcription regulatory networks in Caenorhabditis elegans inferred through reverse-engineering of gene expression profiles constitute biological hypotheses for metazoan development. MOLECULAR BIOSYSTEMS, 5(12), 1817–1830.
    Differential gene expression governs the development, function and pathology of multicellular organisms. Transcription regulatory networks study differential gene expression at a systems level by mapping the interactions between regulatory proteins and target genes. While microarray transcription profiles are the most abundant data for gene expression, it remains challenging to correctly infer the underlying transcription regulatory networks. The reverse-engineering algorithm LeMoNe (learning module networks) uses gene expression profiles to extract ensemble transcription regulatory networks of coexpression modules and their prioritized regulators. Here we apply LeMoNe to a compendium of microarray studies of the worm Caenorhabditis elegans. We obtain 248 modules with a regulation program for 5020 genes and 426 regulators and a total of 24 012 predicted transcription regulatory interactions. Through GO enrichment analysis, comparison with the gene-gene association network WormNet and integration of other biological data, we show that LeMoNe identifies functionally coherent coexpression modules and prioritizes regulators that relate to similar biological processes as the module genes. Furthermore, we can predict new functional relationships for uncharacterized genes and regulators. Based on modules involved in molting, meiosis and oogenesis, ciliated sensory neurons and mitochondrial metabolism, we illustrate the value of LeMoNe as a biological hypothesis generator for differential gene expression in greater detail. In conclusion, through reverse-engineering of C. elegans expression data, we obtained transcription regulatory networks that can provide further insight into metazoan development.
  237. Piganeau, G., Vandepoele, K., Gourbière, S., Van de Peer, Y., & Moreau, H. (2009). Unravelling cis-Regulatory Elements in the Genome of the Smallest Photosynthetic Eukaryote: Phylogenetic Footprinting in Ostreococcus. Journal of Molecular Evolution, 69(3), 249–259.
    We used a phylogenetic footprinting approach, adapted to high levels of divergence, to estimate the level of constraint in intergenic regions of the extremely gene dense Ostreococcus algae genomes (Chlorophyta, Prasinophyceae). We first benchmarked our method against the Saccharomyces sensu stricto genome data and found that the proportion of conserved non-coding sites was consistent with those obtained with methods using calibration by the neutral substitution rate. We then applied our method to the complete genomes of Ostreococcus tauri and O. lucimarinus, which are the most divergent species from the same genus sequenced so far. We found that 77% of intergenic regions in Ostreococcus still contain some phylogenetic footprints, as compared to 88% for Saccharomyces, corresponding to an average rate of constraint on intergenic region of 17% and 30%, respectively. A comparison with some known functional cis-regulatory elements enabled us to investigate whether some transcriptional regulatory pathways were conserved throughout the green lineage. Strikingly, the size of the phylogenetic footprints depends on gene orientation of neighboring genes, and appears to be genus-specific. In Ostreococcus, 5' intergenic regions contain four times more conserved sites than 3' intergenic regions, whereas in yeast a higher frequency of constrained sites in intergenic regions between genes on the same DNA strand suggests a higher frequency of bidirectional regulatory elements. The phylogenetic footprinting approach can be used despite high levels of divergence in the ultrasmall Ostreococcus algae, to decipher structure of constrained regulatory motifs, and identify putative regulatory pathways conserved within the green lineage.
  238. Van de Peer, Y., Maere, S., & Meyer, A. (2009). The evolutionary significance of ancient genome duplications. Nature Reviews Genetics, 10(10), 725–732.
    Many organisms are currently polyploid, or have a polyploid ancestry and now have secondarily 'diploidized' genomes. This finding is surprising because retained whole-genome duplications (WGDs) are exceedingly rare, suggesting that polyploidy is usually an evolutionary dead end. We argue that ancient genome doublings could probably have survived only under very specific conditions, but that, whenever established, they might have had a pronounced impact on species diversification, and led to an increase in biological complexity and the origin of evolutionary novelties.
  239. De Schutter, K., Lin, Y.-C., Tiels, P., Van Hecke, A., Glinka, S., Weber-Lehmann, J., Rouzé, P., et al. (2009). Genome sequence of the recombinant protein production host Pichia pastoris. NATURE BIOTECHNOLOGY, 27(6), 561–U104.
    The methylotrophic yeast Pichia pastoris is widely used for the production of proteins and as a model organism for studying peroxisomal biogenesis and methanol assimilation. P. pastoris strains capable of human-type N-glycosylation are now available, which increases the utility of this organism for biopharmaceutical production. Despite its biotechnological importance, relatively few genetic tools or engineered strains have been generated for P. pastoris. To facilitate progress in these areas, we present the 9.43 Mbp genomic sequence of the GS115 strain of P. pastoris. We also provide manually curated annotation for its 5,313 protein-coding genes.
  240. Dittami, S., Scornet, D., Petit, J.-L., Ségurens, B., Da Silva, C., Corre, E., Dondrup, M., et al. (2009). Global expression analysis of the brown alga Ectocarpus siliculosus (Phaeophyceae) reveals large-scale reprogramming of the transcriptome in response to abiotic stress. Genome Biology, 10, R66.1–R66.20.
    Background: Brown algae (Phaeophyceae) are phylogenetically distant from red and green algae and an important component of the coastal ecosystem. They have developed unique mechanisms that allow them to inhabit the intertidal zone, an environment with high levels of abiotic stress. Ectocarpus siliculosus is being established as a genetic and genomic model for the brown algal lineage, but little is known about its response to abiotic stress. Results: Here we examine the transcriptomic changes that occur during the short term acclimation of E. siliculosus to three different abiotic stress conditions (hyposaline, hypersaline and oxidative stress). Our results show that almost 70% of the expressed genes are regulated in response to at least one of these stressors. Although there are several common elements with terrestrial plants, such as repression of growth-related genes, switching from primary production to protein and nutrient recycling processes, and induction of genes involved in vesicular trafficking, many of the stress-regulated genes are either not known to respond to stress in other organisms or are have been found exclusively in E. siliculosus. Conclusions: This first large-scale transcriptomic study of a brown alga demonstrates that, unlike terrestrial plants, E. siliculosus undergoes extensive reprogramming of its transcriptome during the acclimation to mild abiotic stress. We identify several new genes and pathways with a putative function in the stress response and thus pave the way for more detailed investigations of the mechanisms underlying the stress tolerance of brown algae.
  241. Mueller, Lukas, Klein Lankhorst, R., Tanksley, S. D., Giovannoni, J. J., White, R., Vrebalov, J., Fei, Z., et al. (2009). A snapshot of the emerging tomato genome sequence. PLANT GENOME, 2(1), 78–92.
    The genome of tomato (Solanum lycopersicum L.) is being sequenced by an international consortium of 10 countries (Korea, China, the United Kingdom, India, the Netherlands, France, Japan, Spain, Italy, and the United States) as part of the larger “International Solanaceae Genome Project (SOL): Systems Approach to Diversity and Adaptation” initiative. The tomato genome sequencing project uses an ordered bacterial artificial chromosome (BAC) approach to generate a high-quality tomato euchromatic genome sequence for use as a reference genome for the Solanaceae and euasterids. Sequence is deposited at GenBank and at the SOL Genomics Network (SGN). Currently, there are around 1000 BACs finished or in progress, representing more than a third of the projected euchromatic portion of the genome. An annotation effort is also underway by the International Tomato Annotation Group. The expected number of genes in the euchromatin is ∼40,000, based on an estimate from a preliminary annotation of 11% of finished sequence. Here, we present this first snapshot of the emerging tomato genome and its annotation, a short comparison with potato (Solanum tuberosum L.) sequence data, and the tools available for the researchers to exploit this new resource are also presented. In the future, whole-genome shotgun techniques will be combined with the BAC-by-BAC approach to cover the entire tomato genome. The high-quality reference euchromatic tomato sequence is expected to be near completion by 2010.
  242. Fawcett, J., Maere, S., & Van de Peer, Y. (2009). Plants with double genomes might have had a better chance to survive the Cretaceous-Tertiary extinction event. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 106(14), 5737–5742.
    Most flowering plants have been shown to be ancient polyploids that have undergone one or more whole genome duplications early in their evolution. Furthermore, many different plant lineages seem to have experienced an additional, more recent genome duplication. Starting from paralogous genes lying in duplicated segments or identified in large expressed sequence tag collections, we dated these youngest duplication events through penalized likelihood phylogenetic tree inference. We show that a majority of these independent genome duplications are clustered in time and seem to coincide with the Cretaceous-Tertiary (KT) boundary. The KT extinction event is the most recent mass extinction caused by one or more catastrophic events such as a massive asteroid impact and/or increased volcanic activity. These events are believed to have generated global wildfires and dust clouds that cut off sunlight during long periods of time resulting in the extinction of approximate to 60% of plant species, as well as a majority of animals, including dinosaurs. Recent studies suggest that polyploid species can have a higher adaptability and increased tolerance to different environmental conditions. We propose that polyploidization may have contributed to the survival and propagation of several plant lineages during or following the KT extinction event. Due to advantages such as altered gene expression leading to hybrid vigor and an increased set of genes and alleles available for selection, polyploid plants might have been better able to adapt to the drastically changed environment 65 million years ago.
  243. Michoel, T., De Smet, R., Joshi, A. M., Marchal, K., & Van de Peer, Y. (2009). Reverse-engineering transcriptional modules from gene expression data. (Gustavo Stolovitzky, P. Kahlem, & A. Califano, Eds.)Annals of the New York Academy of Sciences, 1158, 36–43. Presented at the ENFIN-DREAM Conference on the Assessment of Computational Methods in Systems Biology (DREAM2 Conference).
    "Module networks" are a framework to learn gene regulatory networks from expression data using a probabilistic model in which coregulated genes share the same parameters and conditional distributions. We present a method to infer ensembles of such networks and an averaging procedure to extract the statistically most significant modules and their regulators. We show that the inferred probabilistic models extend beyond the dataset used to learn the models.
  244. Worden, A. Z., Lee, J.-H., Mock, T., Rouzé, P., Simmons, M. P., Aerts, A. L., Allen, A. E., et al. (2009). Green evolution and dynamic adaptations revealed by genomes of the marine picoeukaryotes Micromonas. SCIENCE, 324(5924), 268–272.
    Picoeukaryotes are a taxonomically diverse group of organisms less than 2 micrometers in diameter. Photosynthetic marine picoeukaryotes in the genus Micromonas thrive in ecosystems ranging from tropical to polar and could serve as sentinel organisms for biogeochemical fluxes of modern oceans during climate change. These broadly distributed primary producers belong to an anciently diverged sister clade to land plants. Although Micromonas isolates have high 18S ribosomal RNA gene identity, we found that genomes from two isolates shared only 90% of their predicted genes. Their independent evolutionary paths were emphasized by distinct riboswitch arrangements as well as the discovery of intronic repeat elements in one isolate, and in metagenomic data, but not in other genomes. Divergence appears to have been facilitated by selection and acquisition processes that actively shape the repertoire of genes that are mutually exclusive between the two isolates differently than the core genes. Analyses of the Micromonas genomes offer valuable insights into ecological differentiation and the dynamic nature of early plant evolution.
  245. Abeel, T., Van de Peer, Y., & Saeys, Y. (2009). Toward a gold standard for promoter prediction evaluation. BIOINFORMATICS, 25(12), I313–I320. Presented at the Joint conference of Intelligent Systems for Molecular Biology and the European conference on Computational Biology.
    Motivation: Promoter prediction is an important task in genome annotation projects, and during the past years many new promoter prediction programs (PPPs) have emerged. However, many of these programs are compared inadequately to other programs. In most cases, only a small portion of the genome is used to evaluate the program, which is not a realistic setting for whole genome annotation projects. In addition, a common evaluation design to properly compare PPPs is still lacking. Results: We present a large-scale benchmarking study of 17 state-of-the-art PPPs. A multi-faceted evaluation strategy is proposed that can be used as a gold standard for promoter prediction evaluation, allowing authors of promoter prediction software to compare their method to existing methods in a proper way. This evaluation strategy is subsequently used to compare the chosen promoter predictors, and an in-depth analysis on predictive performance, promoter class specificity, overlap between predictors and positional bias of the predictions is conducted.
  246. Abeel, T., Van de Peer, Y., & Saeys, Y. (2009). Java-ML: a machine learning library. JOURNAL OF MACHINE LEARNING RESEARCH, 10, 931–934.
    Java-ML is a collection of machine learning and data mining algorithms, which aims to be a readily usable and easily extensible API for both software developers and research scientists. The interfaces for each type of algorithm are kept simple and algorithms strictly follow their respective interface. Comparing different classifiers or clustering algorithms is therefore straightforward, and implementing new algorithms is also easy. The implementations of the algorithms are clearly written, properly documented and can thus be used as a reference. The library is written in Java and is available from http://java-ml.sourceforge.net/ under the GNU GPL license.
  247. Van de Peer, Y., Fawcett, J., Proost, S., Sterck, L., & Vandepoele, K. (2009). The flowering world: a tale of duplications. TRENDS IN PLANT SCIENCE, 14(12), 680–688.
    Flowering plants contain many genes, most of which were created during the past 200 or so million years through small- and large-scale duplications. Paleo-polyploidy events, in particular, have been the subject of much recent research. There is a growing consensus that one or more genome doubling or merging events occurred early during the evolution of the flowering plants, and that many lineages have since undergone additional, independent and more recent duplication events. Here, we review the difficulties in determining the number of genome duplications and discuss how the completion of some additional genome sequences of species occupying key phylogenetic positions has led to a better understanding of the timing of certain duplication events. This is important if we want to demonstrate the significance of genome duplications for the evolution and radiation of (different groups of) flowering plants.
  248. Baele, Guy, Bredeche, N., Haasdijk, E., Maere, S., Michiels, N., Van de Peer, Y., Schmickl, T., et al. (2009). Open-ended on-board evolutionary robotics for robot swarms. IEEE Congress on Evolutionary Computation (pp. 1123–1130). Presented at the 2009 IEEE Congress on Evolutionary Computation (CEC 2009), New York, NY, USA: IEEE.
    The SYMBRION project stands at the crossroads of artificial life and evolutionary robotics: a swarm of real robots undergoes online evolution by exchanging information in a decentralized Evolutionary Robotics Scheme: the diffusion of each individual's genotype depends both on its ability to survive in an unknown environment as well as its ability to maximize mating opportunities during its lifetime, which suggests an implicit fitness. This paper presents early research and prospective ideas in the context of large-scale swarm robotics projects, focusing on the open-ended evolutionary approach in the SYMBRION project. One key issue of this work is to perform on-board evolution in a spatially distributed population of robots. A real-world experiment is also described which yields important considerations regarding open-ended evolution with real autonomous robots.
  249. De Bodt, S., Proost, S., Vandepoele, K., Rouzé, P., & Van de Peer, Y. (2009). Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression. BMC Genomics, 10(288), 1–15.
    Background: Large-scale identification of the interrelationships between different components of the cell, such as the interactions between proteins, has recently gained great interest. However, unraveling large-scale protein-protein interaction maps is laborious and expensive. Moreover, assessing the reliability of the interactions can be cumbersome. Results: In this study, we have developed a computational method that exploits the existing knowledge on protein-protein interactions in diverse species through orthologous relations on the one hand, and functional association data on the other hand to predict and filter protein-protein interactions in Arabidopsis thaliana. A highly reliable set of protein-protein interactions is predicted through this integrative approach making use of existing protein-protein interaction data from yeast, human, C. elegans and D. melanogaster. Localization, biological process, and co-expression data are used as powerful indicators for protein-protein interactions. The functional repertoire of the identified interactome reveals interactions between proteins functioning in well-conserved as well as plant-specific biological processes. We observe that although common mechanisms (e.g. actin polymerization) and components (e.g. ARPs, actin-related proteins) exist between different lineages, they are active in specific processes such as growth, cancer metastasis and trichome development in yeast, human and Arabidopsis, respectively. Conclusion: We conclude that the integration of orthology with functional association data is adequate to predict protein-protein interactions. Through this approach, a high number of novel protein-protein interactions with diverse biological roles is discovered. Overall, we have predicted a reliable set of protein-protein interactions suitable for further computational as well as experimental analyses.
  250. Michoel, T., De Smet, R., Joshi, A. M., Van de Peer, Y., & Marchal, K. (2009). Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks. BMC Systems Biology, 3(49), 1–13.
    Background: A myriad of methods to reverse-engineer transcriptional regulatory networks have been developed in recent years. Direct methods directly reconstruct a network of pairwise regulatory interactions while module-based methods predict a set of regulators for modules of coexpressed genes treated as a single unit. To date, there has been no systematic comparison of the relative strengths and weaknesses of both types of methods. Results: We have compared a recently developed module-based algorithm, LeMoNe (Learning Module Networks), to a mutual information based direct algorithm, CLR (Context Likelihood of Relatedness), using benchmark expression data and databases of known transcriptional regulatory interactions for Escherichia coli and Saccharomyces cerevisiae. A global comparison using recall versus precision curves hides the topologically distinct nature of the inferred networks and is not informative about the specific subtasks for which each method is most suited. Analysis of the degree distributions and a regulator specific comparison show that CLR is 'regulator-centric', making true predictions for a higher number of regulators, while LeMoNe is 'target-centric', recovering a higher number of known targets for fewer regulators, with limited overlap in the predicted interactions between both methods. Detailed biological examples in E. coli and S. cerevisiae are used to illustrate these differences and to prove that each method is able to infer parts of the network where the other fails. Biological validation of the inferred networks cautions against over-interpreting recall and precision values computed using incomplete reference networks. Conclusion: Our results indicate that module-based and direct methods retrieve largely distinct parts of the underlying transcriptional regulatory networks. The choice of algorithm should therefore be based on the particular biological problem of interest and not on global metrics which cannot be transferred between organisms. The development of sound statistical methods for integrating the predictions of different reverse-engineering strategies emerges as an important challenge for future research.
  251. Proost, S., Van Bel, M., Sterck, L., Billiau, K., Van Parys, T., Van de Peer, Y., & Vandepoele, K. (2009). PLAZA : a comparative genomics resource to study gene and genome evolution in plants. PLANT CELL, 21(12), 3718–3731.
    The number of sequenced genomes of representatives within the green lineage is rapidly increasing. Consequently, comparative sequence analysis has significantly altered our view on the complexity of genome organization, gene function, and regulatory pathways. To explore all this genome information, a centralized infrastructure is required where all data generated by different sequencing initiatives is integrated and combined with advanced methods for data mining. Here, we describe PLAZA, an online platform for plant comparative genomics (http://bioinformatics.psb.ugent.be/plaza/). This resource integrates structural and functional annotation of published plant genomes together with a large set of interactive tools to study gene function and gene and genome evolution. Precomputed data sets cover homologous gene families, multiple sequence alignments, phylogenetic trees, intraspecies whole-genome dot plots, and genomic colinearity between species. Through the integration of high confidence Gene Ontology annotations and tree-based orthology between related species, thousands of genes lacking any functional description are functionally annotated. Advanced query systems, as well as multiple interactive visualization tools, are available through a user-friendly and intuitive Web interface. In addition, detailed documentation and tutorials introduce the different tools, while the workbench provides an efficient means to analyze user-defined gene sets through PLAZA's interface. In conclusion, PLAZA provides a comprehensible and up-to-date research environment to aid researchers in the exploration of genome information within the green plant lineage.
  252. Van de Peer, Y. (2009). Computational approaches to unveiling ancient genome duplications. FEBS JOURNAL (Vol. 276, pp. 9–9). Presented at the 34th FEBS congress.
  253. Cock, J. M., Scornet, D., Peters, A. F., Sterck, L., Rouzé, P., Van de Peer, Y., … Wincker, P. (2009). Evolution of multicellularity in the heterokont lineage : analysis of the Ectocarpus siliculosus genome sequence. In K. Ishida, H. Nozaki, H. Miyashita, T. Horiguchi, & H. Kawai (Eds.), PHYCOLOGIA (Vol. 48, pp. 22–22). Tokyo, Japan.
  254. Vandenbroucke, Korneel, Robbens, S., Vandepoele, K., Inzé, D., Van de Peer, Y., & Van Breusegem, F. (2008). Hydrogen peroxide-induced gene expression across kingdoms: a comparative analysis. MOLECULAR BIOLOGY AND EVOLUTION, 25(3), 507–516.
    Cells react to oxidative stress conditions by launching a defense response through the induction of nuclear gene expression. The advent of microarray technologies allowed monitoring of oxidative stress-dependent changes of transcript levels at a comprehensive and genome-wide scale, resulting in a series of inventories of differentially expressed genes in different organisms. We performed a meta-analysis on hydrogen peroxide (H2O2)-induced gene expression in the cyanobacterium Synechocystis PCC 6803, the yeast Saccharomyces cerevisiae and Schizosaccharomyces pombe, the land plant Arabidopsis thaliana, and the human HeLa cell line. The H2O2-induced gene expression in both yeast species was highly conserved and more similar to the A. thaliana response than that of the human cell line. Based on the expression characteristics of genuine antioxidant genes, we show that the antioxidant capacity of microorganisms and higher eukaryotes is differentially regulated. Four families of evolutionarily conserved eukaryotic proteins could be identified that were H2O2 responsive across kingdoms: DNAJ domain-containing heat shock proteins, small guanine triphosphate-binding proteins, Ca2+-dependent protein kinases, and ubiquitin-conjugating enzymes.
  255. Tzika, A. C., Helaers, R., Van de Peer, Y., & Milinkovitch, M. C. (2008). MANTIS : a phylogenetic framework for multi-species genome comparisons. BIOINFORMATICS, 24(2), 151–157.
    Motivation: Practitioners of comparative genomics face huge analytical challenges as whole genome sequences and functional/expression data accumulate. Furthermore, the field would greatly benefit from a better integration of this wealth of data with evolutionary concepts. Results: Here, we present MANTIS, a relational database for the analysis of (i) gains and losses of genes on specific branches of the metazoan phylogeny, (ii) reconstructed genome content of ancestral species and (iii) over- or under-representation of functions/processes and tissue specificity of gained, duplicated and lost genes. MANTIS estimates the most likely positions of gene losses on the true phylogeny using a maximum-likelihood function. A user-friendly interface and an extensive query system allow to investigate questions pertaining to gene identity, phylogenetic mapping and function/expression parameters.
  256. John, U., Beszteri, B., Derelle, E., Van de Peer, Y., Read, B., Moreau, H., & Cembella, A. (2008). Novel insights into evolution of protistan polyketide synthases through phylogenomic analysis. PROTIST, 159(1), 21–30.
  257. Robbens, S., Rouzé, P., Cock, J. M., Spring, J., Worden, A. Z., & Van de Peer, Y. (2008). The FTO gene, implicated in human obesity, is found only in vertebrates and marine algae. JOURNAL OF MOLECULAR EVOLUTION, 66(1), 80–84.
    Human obesity is a main cause of morbidity and mortality. Recently, several studies have demonstrated an association between the FTO gene locus and early onset and severe obesity. To date, the FTO gene has only been discovered in vertebrates. We identified FTO homologs in the complete genome sequences of various evolutionary diverse marine eukaryotic algae, ranging from unicellular photosynthetic picoplankton to a multicellular seaweed. However, FTO homologs appear to be absent from all other completely sequenced genomes of plants, fungi, and invertebrate animals. Although the biological roles of these marine algal FTO homologs are still unknown, these genes will be useful for exploring basic protein features and could hence help unravel the function of the FTO gene in vertebrates and its inferred link with obesity in humans.
  258. Abeel, T., Saeys, Y., Bonnet, E., Rouzé, P., & Van de Peer, Y. (2008). Generic eukaryotic core promoter prediction using structural features of DNA. GENOME RESEARCH, 18(2), 310–323.
    Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts of high-quality training data and often behave like black box models that output predictions that are difficult to interpret. Here, we present a novel approach for predicting promoters in whole-genome sequences by using large-scale structural properties of DNA. Our technique requires no training, is applicable to many eukaryotic genomes, and performs extremely well in comparison with the best available promoter prediction programs. Moreover, it is fast, simple in design, and has no size constraints, and the results are easily interpretable. We compared our approach with 14 current state-of-the-art implementations using human gene and transcription start site data and analyzed the ENCODE region in more detail. We also validated our method on 12 additional eukaryotic genomes, including vertebrates, invertebrates, plants, fungi, and protists.
  259. Martin, F., Aerts, A., Ahrén, D., Brun, A., Danchin, E., Duchaussoy, F., Gibon, J., et al. (2008). The genome of Laccaria bicolor provides insights into mycorrhizal symbiosis. NATURE, 452(7183), 88–92.
    Mycorrhizal symbioses - the union of roots and soil fungi - are universal in terrestrial ecosystems and may have been fundamental to land colonization by plants(1,2). Boreal, temperate and montane forests all depend on ectomycorrhizae(1). Identification of the primary factors that regulate symbiotic development and metabolic activity will therefore open the door to understanding the role of ectomycorrhizae in plant development and physiology, allowing the full ecological significance of this symbiosis to be explored. Here we report the genome sequence of the ectomycorrhizal basidiomycete Laccaria bicolor ( Fig. 1) and highlight gene sets involved in rhizosphere colonization and symbiosis. This 65- megabase genome assembly contains 20,000 predicted protein- encoding genes and a very large number of transposons and repeated sequences. We detected unexpected genomic features, most notably a battery of effector- type small secreted proteins ( SSPs) with unknown function, several of which are only expressed in symbiotic tissues. The most highly expressed SSP accumulates in the proliferating hyphae colonizing the host root. The ectomycorrhizae- specific SSPs probably have a decisive role in the establishment of the symbiosis. The unexpected observation that the genome of L. bicolor lacks carbohydrate- active enzymes involved in degradation of plant cell walls, but maintains the ability to degrade non- plant cell wall polysaccharides, reveals the dual saprotrophic and biotrophic lifestyle of the mycorrhizal fungus that enables it to grow within both soil and living plant roots. The predicted gene inventory of the L. bicolor genome, therefore, points to previously unknown mechanisms of symbiosis operating in biotrophic mycorrhizal fungi. The availability of this genome provides an unparalleled opportunity to develop a deeper understanding of the processes by which symbionts interact with plants within their ecosystem to perform vital functions in the carbon and nitrogen cycles that are fundamental to sustainable plant productivity.
  260. Saeys, Yvan, Abeel, T., & Van de Peer, Y. (2008). Robust feature selection using ensemble feature selection techniques. In W. Daelemans, B. Goethals, & K. Morik (Eds.), Lecture Notes in Artificial Intelligence (Vol. 5212, pp. 313–325). Presented at the European conference on Principles of Data Mining and Knowledge Discovery, Berlin, Germany: Springer.
    Robustness or stability of feature selection techniques is a, topic of recent interest, and is an important issue when selected feature subsets are subsequently analysed by domain experts to gain more insight into the problem modelled. In this work, we investigate the use of ensemble feature selection techniques, where multiple feature selection methods are combined to yield more robust results. We show that these techniques show great promise for high-dimensional domains with small sample sizes, and provide more robust feature subsets than a single feature selection technique. In addition, we also investigate the effect of ensemble feature selection techniques on classification performance, giving rise to a new model selection strategy.
  261. Bowler, Chris, Allen, A. E., Badger, J. H., Grimwood, J., Jabbari, K., Kuo, A., Maheswari, U., et al. (2008). The Phaeodactylum genome reveals the evolutionary history of diatom genomes. NATURE, 456(7219), 239–244.
    Diatoms are photosynthetic secondary endosymbionts found throughout marine and freshwater environments, and are believed to be responsible for around one- fifth of the primary productivity on Earth(1,2). The genome sequence of the marine centric diatom Thalassiosira pseudonana was recently reported, revealing a wealth of information about diatom biology(3-5). Here we report the complete genome sequence of the pennate diatom Phaeodactylum tricornutum and compare it with that of T. pseudonana to clarify evolutionary origins, functional significance and ubiquity of these features throughout diatoms. In spite of the fact that the pennate and centric lineages have only been diverging for 90 million years, their genome structures are dramatically different and a substantial fraction of genes (similar to 40%) are not shared by these representatives of the two lineages. Analysis of molecular divergence compared with yeasts and metazoans reveals rapid rates of gene diversification in diatoms. Contributing factors include selective gene family expansions, differential losses and gains of genes and introns, and differential mobilization of transposable elements. Most significantly, we document the presence of hundreds of genes from bacteria. More than 300 of these gene transfers are found in both diatoms, attesting to their ancient origins, and many are likely to provide novel possibilities for metabolite management and for perception of environmental signals. These findings go a long way towards explaining the incredible diversity and success of the diatoms in contemporary oceans.
  262. Armañanzas, R., Inza, I., Santana, R., Saeys, Y., Flores, J. L., Lozano, J. A., Van de Peer, Y., et al. (2008). A review of estimation of distribution algorithms in bioinformatics. BIODATA MINING, 1.
    Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain.
  263. Fierro, A. C., Vandenbussche, F., Engelen, K., Van de Peer, Y., & Marchal, K. (2008). Meta analysis of gene expression data within and across species. CURRENT GENOMICS, 9(8), 525–534.
    Since the second half of the 1990s, a large number of genome-wide analyses have been described that study gene expression at the transcript level. To this end, two major strategies have been adopted, a first one relying on hybridization techniques such as microarrays, and a second one based on sequencing techniques such as serial analysis of gene expression (SAGE), cDNA-AFLP, and analysis based on expressed sequence tags (ESTs). Despite both types of profiling experiments becoming routine techniques in many research groups, their application remains costly and laborious. As a result, the number of conditions profiled in individual studies is still relatively small and usually varies from only two to few hundreds of samples for the largest experiments. More and more, scientific journals require the deposit of these high throughput experiments in public databases upon publication. Mining the information present in these databases offers molecular biologists the possibility to view their own small-scale analysis in the light of what is already available. However, so far, the richness of the public information remains largely unexploited. Several obstacles such as the correct association between ESTs and microarray probes with the corresponding gene transcript, the incompleteness and inconsistency in the annotation of experimental conditions, and the lack of standardized experimental protocols to generate gene expression data, all impede the successful mining of these data. Here, we review the potential and difficulties of combining publicly available expression data from respectively EST analyses and microarray experiments. With examples from literature, we show how meta-analysis of expression profiling experiments can be used to study expression behavior in a single organism or between organisms, across a wide range of experimental conditions. We also provide an overview of the methods and tools that can aid molecular biologists in exploiting these public data.
  264. Abeel, T., Saeys, Y., & Van de Peer, Y. (2008). ProSOM: core promoter identification in the human genome. In L. Wehenkel, P. Geurts, & R. Marée (Eds.), Benelearn 08 : the annual Belgian-Dutch machine learning conference (pp. 77–78). Presented at the 18th Annual Belgian-Dutch Machine Learning Conference (Benelearn 2008), Liège, Belgium: Université de Liège.
    More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Better core promoter prediction can improve genome annotation and can be used to guide experimental work. Comparing the average structural profile of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called Pro- SOM, is available and has been compared with the state-of-the-art.
  265. Saeys, Yvan, Abeel, T., & Van de Peer, Y. (2008). Towards robust feature selection techniques. In L. Wehenkel, P. Geurts, & R. Marée (Eds.), Benelearn 08 : the annual Belgian-Dutch machine learning conference (pp. 45–46). Presented at the 18th Annual Belgian-Dutch Machine Learning Conference (Benelearn 2008), Liège, Belgium: Université de Liège.
  266. Van Landeghem, S., Saeys, Y., Van de Peer, Y., & De Baets, B. (2008). Benchmarking machine learning techniques for the extraction of protein-protein interactions from text. In L. Wehenkel, P. Geurts, & R. Marée (Eds.), Benelearn 08 : the annual Belgian-Dutch machine learning conference (pp. 79–80). Liège, Belgium: Université de Liège.
  267. Van Landeghem, S., Saeys, Y., Van de Peer, Y., & De Baets, B. (2008). Extracting protein-protein interactions from text using rich feature vectors and feature selection. In T. Salakoski, D. Rebholz-Schuhmann, & S. Pyysalo (Eds.), SMBM ’08 : proceedings of the third symposium on semantic mining in biomedicine (pp. 77–84). Turku, Finland: Turku Centre for Computer Sciences (TUCS).
    Because of the intrinsic complexity of natural language, automatically extracting accurate information from text remains a challenge. We have applied rich featurevectors derived from dependency graphs to predict protein-protein interactions using machine learning techniques. We present the first extensive analysis of applyingfeature selection in this domain, and show that it can produce more cost-effective models. For the first time, our technique was also evaluated on several large-scalecross-dataset experiments, which offers a more realistic view on model performance. During benchmarking, we encountered several fundamental problems hindering comparability with other methods. We present a set of practical guidelines to set up ameaningful evaluation. Finally, we have analysed the feature sets from our experiments before and after feature selection, and evaluated the contribution of both lexical and syntacticinformation to our method. The gained insight will be useful to develop better performing methods in this domain.
  268. Baele, Guy, Van de Peer, Y., & Vansteelandt, S. (2008). A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. SYSTEMATIC BIOLOGY, 57(5), 675–692.
    In this article, we present a likelihood-based framework for modeling site dependencies. Our approach builds upon standard evolutionary models but incorporates site dependencies across the entire tree by letting the evolutionary parameters in these models depend upon the ancestral states at the neighboring sites. It thus avoids the need for introducing new and high-dimensional evolutionary models for site-dependent evolution. We propose a Markov chain Monte Carlo approach with data augmentation to infer the evolutionary parameters under our model. Although our approach allows for wide-ranging site dependencies, we illustrate its use, in two non-coding datasets, in the case of nearest-neighbor dependencies (i.e., evolution directly depending only upon the immediate flanking sites). The results reveal that the general time-reversible model with nearest-neighbor dependencies substantially improves the fit to the data as compared to the corresponding model with site independence. Using the parameter estimates from our model, we elaborate on the importance of the 5-methylcytosine deamination process (i.e., the CpG effect) and show that this process also depends upon the 5' neighboring base identity. We hint at the possibility of a so-called TpA effect and show that the observed substitution behavior is very complex in the light of dinucleotide estimates. We also discuss the presence of CpG effects in a nuclear small subunit dataset and find significant evidence that evolutionary models incorporating context-dependent effects perform substantially better than independent-site models and in some cases even outperform models that incorporate varying rates across sites.
  269. Rensing, S. A., Lang, D., Zimmer, A. D., Terry, A., Salamov, A., Shapiro, H., Nishiyama, T., et al. (2008). The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. SCIENCE, 319(5859), 64–69.
    We report the draft genome sequence of the model moss Physcomitrella patens and compare its features with those of flowering plants, from which it is separated by more than 400 million years, and unicellular aquatic algae. This comparison reveals genomic changes concomitant with the evolutionary movement to land, including a general increase in gene family complexity; loss of genes associated with aquatic environments ( e. g., flagellar arms); acquisition of genes for tolerating terrestrial stresses ( e. g., variation in temperature and water availability); and the development of the auxin and abscisic acid signaling pathways for coordinating multicellular growth and dehydration response. The Physcomitrella genome provides a resource for phylogenetic inferences about gene function and for experimental analysis of plant processes through this plant's unique facility for reverse genetics.
  270. Amoutzias, G., & Van de Peer, Y. (2008). Together we stand: genes cluster to coordinate regulation. DEVELOPMENTAL CELL.
    Although most eukaryotic genomes lack operons, occasionally clusters of genes are discovered that are related in function. Now, a metabolic operon-like gene cluster has been described in Arabidopsis thaliana that is needed for triterpene synthesis.
  271. Amoutzias, G., Robertson, D. L., Van de Peer, Y., & Oliver, S. G. (2008). Choose your partners: dimerization in eukaryotic transcription factors. TRENDS IN BIOCHEMICAL SCIENCES, 33(5), 220–229.
    In many eukaryotic transcription factor gene families, proteins require a physical interaction with an identical molecule or with another molecule within the same family to form a functional dimer and bind DNA. Depending on the choice of partner and the cellular context, each dimer triggers a sequence of regulatory events that lead to a particular cellular fate, for example, proliferation or differentiation. Recent syntheses of genomic and functional data reveal that partner choice is not random; instead, dimerization specificities, which are strongly linked to the evolution of the protein family, apply. Our focus is on understanding these interaction specificities, their functional consequences and how they evolved. This knowledge is essential for understanding gene regulation and designing a new generation of drugs.
  272. Amoutzias, G., Van de Peer, Y., & Mossialos, D. (2008). Evolution and taxonomic distribution of nonribosomal peptide and polyketide synthases. FUTURE MICROBIOLOGY, 3(3), 361–370.
    The majority of nonribosomal peptide synthases and type I polyketide synthases are multimodular megasynthases of oligopeptide and polyketide secondary metabolites, respectively. Owing to their multimodular architecture, they synthesize their metabolites in assembly line logic. The ongoing genomic revolution together with the application of computational tools has provided the opportunity to mine the various genomes for these enzymes and identify those organisms that produce many oligopeptide and polyketide metabolites. In addition, scientists have started to comprehend the molecular mechanisms of megasynthase evolution, by duplication, recombination, point mutation and module skipping. This knowledge and computational analyses have been implemented towards predicting the specificity of these megasynthases and the structure of their end products. It is an exciting field, both for gaining deeper insight into their basic molecular mechanisms and exploiting them biotechnologically.
  273. Foissac, S., Gouzy, J., Rombauts, S., Mathé, C., Amselem, J., Sterck, L., Van de Peer, Y., et al. (2008). Genome annotation in plants and fungi: EuGène as a model platform. CURRENT BIOINFORMATICS, 3(2), 87–97.
    In this era of whole genome sequencing, reliable genome annotations ( identification of functional regions) are the cornerstones for many subsequent analyses. Not only is careful annotation important for studying the gene and gene family content of a genome and its host, but also for wide-scale transcriptome and proteome analyses attempting to describe a certain biological process or to get a global picture of a cell's behavior. Although the number of sequenced genomes is increasing thanks to the application of new technologies, genome-wide analyses will critically depend on the quality of the genome annotations. However, the annotation process is more complicated in the plant field than in the animal field because of the limited funding that leads to much fewer experimental data and less annotation expertise. This situation calls for highly automated annotation platforms that can make the best use of all available data, experimental or not. We discuss how the gene prediction (the process of predicting protein gene structures in genomic sequences) research field increasingly shifts from methods that typically exploited one or two types of data to more integrative approaches that simultaneously deal with various experimental, statistical, or other in silico evidence. We illustrate the importance of integrative approaches for producing high-quality automatic annotations of genomes of plants and algae as well as of fungi that live in close association with plants using the platform EuGene as an example.
  274. Joshi, A. M., Van de Peer, Y., & Michoel, T. (2008). Analysis of a Gibbs sampler method for model-based clustering of gene expression data. BIOINFORMATICS, 24(2), 176–183.
    Motivation: Over the last decade, a large variety of clustering algorithms have been developed to detect coregulatory relationships among genes from microarray gene expression data. Model-based clustering approaches have emerged as statistically well-grounded methods, but the properties of these algorithms when applied to large-scale data sets are not always well understood. An in-depth analysis can reveal important insights about the performance of the algorithm, the expected quality of the output clusters, and the possibilities for extracting more relevant information out of a particular data set. Results: We have extended an existing algorithm for model-based clustering of genes to simultaneously cluster genes and conditions, and used three large compendia of gene expression data for Saccharomyces cerevisiae to analyze its properties. The algorithm uses a Bayesian approach and a Gibbs sampling procedure to iteratively update the cluster assignment of each gene and condition. For large-scale data sets, the posterior distribution is strongly peaked on a limited number of equiprobable clusterings. A GO annotation analysis shows that these local maxima are all biologically equally significant, and that simultaneously clustering genes and conditions performs better than only clustering genes and assuming independent conditions. A collection of distinct equivalent clusterings can be summarized as a weighted graph on the set of genes, from which we extract fuzzy, overlapping clusters using a graph spectral method. The cores of these fuzzy clusters contain tight sets of strongly coexpressed genes, while the overlaps exhibit relations between genes showing only partial coexpression.
  275. Simillion, C., Janssens, K., Sterck, L., & Van de Peer, Y. (2008). i-ADHoRe 2.0: an improved tool to detect degenerated genomic homology using genomic profiles. BIOINFORMATICS, 24(1), 127–128.
    i-ADHoRe is a software tool that combines gene content and gene order information of homologous genomic segments into profiles to detect highly degenerated homology relations within and between genomes. The new version offers, besides a significant increase in performance, several optimizations to the algorithm, most importantly to the profile alignment routine. As a result, the annotations of multiple genomes, or parts thereof, can be fed simultaneously into the program, after which it will report all regions of homology, both within and between genomes.
  276. Abeel, T., Saeys, Y., Rouzé, P., & Van de Peer, Y. (2008). ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. BIOINFORMATICS, 24(13), I24–I31. Presented at the 16th ISMB Conference on Intelligent Systems for Molecular Biology.
    Motivation: More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Because the identification of the transcription initiation region is such a challenging problem, it is not yet a common practice to integrate transcription start site prediction in genome annotation projects. Nevertheless, better core promoter prediction can improve genome annotation and can be used to guide experimental work. Results: Comparing the average structural profile based on base stacking energy of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called ProSOM, is available and has been compared with the state-of-the-art. We propose an objective, accurate and biologically sound validation scheme for core promoter predictors. ProSOM performs at least as well as the software currently available, but our technique is more balanced in terms of the number of predicted sites and the number of false predictions, resulting in a better all-round performance. Additional tests on the ENCODE regions of the human genome show that 98 of all predictions made by ProSOM can be associated with transcriptionally active regions, which demonstrates the high precision.
  277. Martens, Cindy, Vandepoele, K., & Van de Peer, Y. (2008). Whole-genome analysis reveals molecular innovations and evolutionary transitions in chromalveolate species. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 105(9), 3427–3432.
    The chromalveolates form a highly diverse and fascinating assemblage of organisms, ranging from obligatory parasites such as Plasmodium to free-living ciliates and algae such as kelps, diatoms, and dinoflagellates. Many of the species in this monophyletic grouping are of major medical, ecological, and economical importance. Nevertheless, their genome evolution is much less well studied than that of higher plants, animals, or fungi. In the current study, we have analyzed and compared 12 chromalveolate species for which whole-sequence information is available and provide a detailed picture on gene loss and gene gain in the different lineages. As expected, many gene loss and gain events can be directly correlated with the lifestyle and specific adaptations of the organisms studied. For instance, in the obligate intracellular Apicomplexa we observed massive loss of genes that play a role in general basic processes such as amino acid, carbohydrate, and lipid metabolism, reflecting the transition of a free-living to an obligate intracellular lifestyle. In contrast, many gene families show species-specific expansions, such as those in the plant pathogen oomycete Phytophthora that are involved in degrading the plant cell wall polysaccharides to facilitate the pathogen invasion process. In general, chromalveolates show a tremendous difference in genome structure and evolution and in the number of genes they have lost or gained either through duplication or horizontal gene transfer.
  278. Van Bel, M., Saeys, Y., & Van de Peer, Y. (2008). FunSiP: a modular and extensible classifier for the prediction of functional sites in DNA. BIOINFORMATICS, 24(13), 1532–1533.
    Motivation: Many problems in genome annotation are tackled by using a classification model to predict functional sites such as splice sites, translation start sites or stop codons. Locating the correct position of these sites remains one of the most important but also one of the most difficult issues in the structural annotation of genomes. Most of the software currently in use is written for a very specific problem, thereby limiting the possibilities for reuse. Summary: We developed a software platform that uses a very general approach towards the classification of functional sites in DNA sequences. The program uses an ab initio approach towards the identification of these sites, and extends SpliceMachine, a previously developed splice site predictor that shows state-of-the art performance for both donor and acceptor splice site recognition in the human and Arabidopsis thaliana genome.
  279. Sterck, L., Rombauts, S., Vandepoele, K., Rouzé, P., & Van de Peer, Y. (2007). How many genes are there in plants (... and why are they there)? CURRENT OPINION IN PLANT BIOLOGY, 10(2), 199–203.
    Annotation of the first few complete plant genomes has revealed that plants have many genes. For Arabidopsis, over 26 500 gene loci have been predicted, whereas for rice, the number adds up to 41 000. Recent analysis of the poplar genome suggests more than 45 000 genes, and partial sequence data from Medicago and Lotus also suggest that these plants contain more than 40 000 genes. Nevertheless, estimations suggest that ancestral angiosperms had no more than 12 000-14 000 genes. One explanation for the large increase in gene number during angiosperm evolution is gene duplication. It has been shown previously that the retention of duplicates following small- and large-scale duplication events in plants is substantial. Taking into account the function of genes that have been duplicated, we are now beginning to understand why many plant genes might have been retained, and how their retention might be linked to the typical lifestyle of plants.
  280. Carlton, J. M., Hirt, R. P., Silva, J. C., Delcher, A. L., Schatz, M., Zhao, Q., Wortman, J. R., et al. (2007). Draft genome sequence of the sexually transmitted pathogen Trichomonas vaginalis. SCIENCE, 315(5809), 207–212.
    We describe the genome sequence of the protist Trichomonas vaginalis, a sexually transmitted human pathogen. Repeats and transposable elements comprise about two-thirds of the similar to 160-megabase genome, reflecting a recent massive expansion of genetic material. This expansion, in conjunction with the shaping of metabolic pathways that likely transpired through lateral gene transfer from bacteria, and amplification of specific gene families implicated in pathogenesis and phagocytosis of host proteins may exemplify adaptations of the parasite during its transition to a urogenital environment. The genome sequence predicts previously unknown functions for the hydrogenosome, which support a common evolutionary origin of this unusual organelle with mitochondria.
  281. Palenik, Brian, Grimwood, J., Aerts, A., Rouzé, P., Salamov, A., Putnam, N., Dupont, C., et al. (2007). The tiny eukaryote Ostreococcus provides genomic insights into the paradox of plankton speciation. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 104(18), 7705–7710.
    The smallest known eukaryotes, at approximate to 1-mu m diameter, are ostreococcus tauri and related species of marine phytoplankton. The genome of Ostreococcus lucimarinus has been completed and compared with that of O. tauri. This comparison reveals surprising differences across orthologous chromosomes in the two species from highly syntenic chromosomes in most cases to chromosomes with almost no similarity. Species divergence in these phytoplankton is occurring through multiple mechanisms acting differently on different chromosomes and likely including acquisition of new genes through horizontal gene transfer. We speculate that this latter process may be involved in altering the cell-surface characteristics of each species. In addition, the genome of O. lucimarinus provides insights into the unique metal metabolism of these organisms, which are predicted to have a large number of selenocysteine-containing proteins. Selenoenzymes are more catalytically active than similar enzymes lacking selenium, and thus the cell may require less of that protein. As reported here, selenoenzymes, novel fusion proteins, and loss of some major protein families including ones associated with chromatin are likely important adaptations for achieving a small cell size.
  282. Merks, R., Van de Peer, Y., Inzé, D., & Beemster, G. (2007). Canalization without flux sensors: a traveling-wave hypothesis. TRENDS IN PLANT SCIENCE, 12(9), 384–390.
    In 1969, Tsvi Sachs published his seminal hypothesis of vascular development in plants: the canalization hypothesis. A positive feedback loop between the flux of the phytohormone auxin and the cells' auxin transport capacity would canalize auxin progressively into discrete channels, which would then differentiate into vascular tissues. Recent experimental studies confirm the central role of polar auxin flux in plant vasculogenesis, but it is unclear if and by which mechanism plant cells could respond to auxin flux. In this Opinion article, we review auxin perception mechanisms and argue that these respond more likely to auxin concentrations than to auxin flux. We propose an alternative mechanism for polar auxin channeling, which is more consistent with recent molecular observations.
  283. Michoel, T., Maere, S., Bonnet, E., Joshi, A. M., Saeys, Y., Van den Bulcke, T., Van Leemput, K., et al. (2007). Validating module network learning algorithms using simulated data. BMC BIOINFORMATICS, 8(suppl. 2).
    Background: In recent years, several authors have used probabilistic graphical models to learn expression modules and their regulatory programs from gene expression data. Despite the demonstrated success of such algorithms in uncovering biologically relevant regulatory relations, further developments in the area are hampered by a lack of tools to compare the performance of alternative module network learning strategies. Here, we demonstrate the use of the synthetic data generator SynTReN for the purpose of testing and comparing module network learning algorithms. We introduce a software package for learning module networks, called LeMoNe, which incorporates a novel strategy for learning regulatory programs. Novelties include the use of a bottom-up Bayesian hierarchical clustering to construct the regulatory programs, and the use of a conditional entropy measure to assign regulators to the regulation program nodes. Using SynTReN data, we test the performance of LeMoNe in a completely controlled situation and assess the effect of the methodological changes we made with respect to an existing software package, namely Genomica. Additionally, we assess the effect of various parameters, such as the size of the data set and the amount of noise, on the inference performance. Results: Overall, application of Genomica and LeMoNe to simulated data sets gave comparable results. However, LeMoNe offers some advantages, one of them being that the learning process is considerably faster for larger data sets. Additionally, we show that the location of the regulators in the LeMoNe regulation programs and their conditional entropy may be used to prioritize regulators for functional validation, and that the combination of the bottom-up clustering strategy with the conditional entropy-based assignment of regulators improves the handling of missing or hidden regulators. Conclusion: We show that data simulators such as SynTReN are very well suited for the purpose of developing, testing and improving module network algorithms. We used SynTReN data to develop and test an alternative module network learning strategy, which is incorporated in the software package LeMoNe, and we provide evidence that this alternative strategy has several advantages with respect to existing methods.
  284. Saeys, Y., Abeel, T., Degroeve, S., & Van de Peer, Y. (2007). Translation initiation site prediction on a genomic scale : beauty in simplicity. BIOINFORMATICS, 23(13), i418–i423.
    Motivation: The correct identification of translation initiation sites (TIS) remains a challenging problem for computational methods that automatically try to solve this problem. Furthermore, the lion's share of these computational techniques focuses on the identification of TIS in transcript data. However, in the gene prediction context the identification of TIS occurs on the genomic level, which makes things even harder because at the genome level many more pseudo-TIS occur, resulting in models that achieve a higher number of false positive predictions. Results: In this article, we evaluate the performance of several 'simple' TIS recognition methods at the genomic level, and compare them to state-of-the-art models for TIS prediction in transcript data. We conclude that the simple methods largely outperform the complex ones at the genomic scale, and we propose a new model for TIS recognition at the genome level that combines the strengths of these simple models. The new model obtains a false positive rate of 0.125 at a sensitivity of 0.80 on a well annotated human chromosome ( chromosome 21). Detailed analyses show that the model is useful, both on its own and in a simple gene prediction setting.
  285. Robbens, S., Derelle, E., Ferraz, C., Wuyts, J., Moreau, H., & Van de Peer, Y. (2007). The complete chloroplast and mitochondrial DNA sequence of Ostreococcus tauri: organelle genomes of the smallest eukaryote are examples of compaction. MOLECULAR BIOLOGY AND EVOLUTION, 24(4), 956–968.
    The complete nucleotide sequence of the mt (mitochondrial) and cp (chloroplast) genomes of the unicellular green alga Ostreococcus tauri has been determined. The mt genome assembles as a circle of 44,237 bp and contains 65 genes. With an overall average length of only 42 bp for the intergenic regions, this is the most gene-dense mt genome of all Chlorophyta. Furthermore, it is characterized by a unique segmental duplication, encompassing 22 genes and covering 44% of the genome. Such a duplication has not been observed before in green algae, although it is also present in the mt genomes of higher plants. The quadripartite cp genome forms a circle of 71,666 bp, containing 86 genes divided over a larger and a smaller single-copy region, separated by 2 inverted repeat sequences. Based on genome size and number of genes, the Ostreococcus cp genome is the smallest known among the green algae. Phylogenetic analyses based on a concatenated alignment of cp, mt, and nuclear genes confirm the position of O. tauri within the Prasinophyceae, an early branch of the Chlorophyta.
  286. Casneuf, T., Van de Peer, Y., & Huber, W. (2007). In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation. BMC BIOINFORMATICS, 8.
    Background: Microarray co-expression signatures are an important tool for studying gene function and relations between genes. In addition to genuine biological co-expression, correlated signals can result from technical deficiencies like hybridization of reporters with off-target transcripts. An approach that is able to distinguish these factors permits the detection of more biologically relevant co-expression signatures. Results: We demonstrate a positive relation between off-target reporter alignment strength and expression correlation in data from oligonucleotide genechips. Furthermore, we describe a method that allows the identification, from their expression data, of individual probe sets affected by off target hybridization. Conclusion: The effects of off-target hybridization on expression correlation coefficients can be substantial, and can be alleviated by more accurate mapping between microarray reporters and the target transcriptome. We recommend attention to the mapping for any microarray analysis of gene expression patterns.
  287. Saeys, Yvan, Rouzé, P., & Van de Peer, Y. (2007). In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists. BIOINFORMATICS, 23(4), 414–420.
    Motivation: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. Results: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes.
  288. Saeys, Y., & Van de Peer, Y. (2007). Enhancing coding potential prediction for short sequences using complementary sequence features and feature selection. In K. Tuyts, R. Westra, Y. Saeys, & A. Nowé (Eds.), Lecture Notes in Bioinformatics (Vol. 4366, pp. 107–118). Presented at the 1st International workshop on Knowledge Discovery and Emergent Complexity in Bioinformatics (KDECB 2006), Berlin, Germany: Springer.
    The identification of coding potential in DNA sequences is of major importance in bioinformatics, where it is often used to assist expert systems that automatically try to recognize genes in genomes. For longer sequences, the identification of coding potential tends to be easier due to a better signal-to-noise ratio, whereas for very short sequences the issue becomes more problematic. In this paper, we present new methods that specifically aim at a better prediction of coding potential in short sequences. To this end, we combine different, complementary sequence features together with a feature selection strategy. Results comparing the new classifiers to state of the art models show that our new approach significantly outperforms the existing methods when applied to short sequences.
  289. Van Hellemont, R, Blomme, T., Van de Peer, Y., & Marchal, K. (2007). Divergence of regulatory sequences in duplicated fish genes. In J.-N. Volff (Ed.), Gene and protein evolution (Vol. 3, pp. 81–100). Basel, Switzerland: Karger.
  290. Velasco, R., Zharkikh, A., Troggio, M., Cartwright, D. A., Cestaro, A., Pruss, D., Pindo, M., et al. (2007). A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLOS ONE, 2(12).
    Background. Worldwide, grapes and their derived products have a large market. The cultivated grape species Vitis vinifera has potential to become a model for fruit trees genetics. Like many plant species, it is highly heterozygous, which is an additional challenge to modern whole genome shotgun sequencing. In this paper a high quality draft genome sequence of a cultivated clone of V. vinifera Pinot Noir is presented. Principal Findings. We estimate the genome size of V. vinifera to be 504.6 Mb. Genomic sequences corresponding to 477.1 Mb were assembled in 2,093 metacontigs and 435.1 Mb were anchored to the 19 linkage groups (LGs). The number of predicted genes is 29,585, of which 96.1% were assigned to LGs. This assembly of the grape genome provides candidate genes implicated in traits relevant to grapevine cultivation, such as those influencing wine quality, via secondary metabolites, and those connected with the extreme susceptibility of grape to pathogens. Single nucleotide polymorphism ( SNP) distribution was consistent with a diffuse haplotype structure across the genome. Of around 2,000,000 SNPs, 1,751,176 were mapped to chromosomes and one or more of them were identified in 86.7% of anchored genes. The relative age of grape duplicated genes was estimated and this made possible to reveal a relatively recent Vitis-specific large scale duplication event concerning at least 10 chromosomes (duplication not reported before). Conclusions. Sanger shotgun sequencing and highly efficient sequencing by synthesis (SBS), together with dedicated assembly programs, resolved a complex heterozygous genome. A consensus sequence of the genome and a set of mapped marker loci were generated. Homologous chromosomes of Pinot Noir differ by 11.2% of their DNA (hemizygous DNA plus chromosomal gaps). SNP markers are offered as a tool with the potential of introducing a new era in the molecular breeding of grape.
  291. Bonet, Isis, García, M. M., Saeys, Y., Van de Peer, Y., & Grau, R. (2007). Predicting human immunodeficiency virus (HIV) drug resistance using recurrent neural networks. In J. Mira & J. Álvarez (Eds.), Lecture Notes in Computer Science (Vol. 4527, pp. 234–243). Presented at the 2nd International work-conference on the Interplay Between Natural and Artificial Computation (IWINAC 2007), Berlin, Germany: Springer.
    Predicting HIV resistance to drugs is one of many problems for which bioinformaticians have implemented and trained machine learning methods, such as neural networks. Predicting HIV resistance would be much easier if we could directly use the three-dimensional (3D) structure of the targeted protein sequences, but unfortunately we rarely have enough structural information available to train a neural network. Fur-thermore, prediction of the 3D structure of a protein is not straightforward. However, characteristics related to the 3D structure can be used to train a machine learning algorithm as an alternative to take into account the information of the protein folding in the 3D space. Here, starting from this philosophy, we select the amino acid energies as features to predict HIV drug resistance, using a specific topology of a neural network. In this paper, we demonstrate that the amino acid ener-gies are good features to represent the HIV genotype. In addi-tion, it was shown that Bidirectional Recurrent Neural Networks can be used as an efficient classification method for this prob-lem. The prediction performance that was obtained was greater than or at least comparable to results obtained previously. The accuracies vary between 81.3% and 94.7%.
  292. Rensing, S. A., Ick, J., Fawcett, J., Lang, D., Zimmer, A., Van de Peer, Y., & Reski, R. (2007). An ancient genome duplication contributed to the abundance of metabolic genes in the moss Physcomitrella patens. BMC EVOLUTIONARY BIOLOGY, 7.
    Background: Analyses of complete genomes and large collections of gene transcripts have shown that most, if not all seed plants have undergone one or more genome duplications in their evolutionary past. Results: In this study, based on a large collection of EST sequences, we provide evidence that the haploid moss Physcomitrella patens is a paleopolyploid as well. Based on the construction of linearized phylogenetic trees we infer the genome duplication to have occurred between 30 and 60 million years ago. Gene Ontology and pathway association of the duplicated genes in P. patens reveal different biases of gene retention compared with seed plants. Conclusion: Metabolic genes seem to have been retained in excess following the genome duplication in P. patens. This might, at least partly, explain the versatility of metabolism, as described for P. patens and other mosses, in comparison to other land plants.
  293. Robbens, S., Petersen, J., Brinkmann, H., Rouzé, P., & Van de Peer, Y. (2007). Unique regulation of the Calvin cycle in the ultrasmall green alga Ostreococcus. JOURNAL OF MOLECULAR EVOLUTION, 64(5), 601–604.
    Glyceraldehyde-3-phosphate dehydrogenase (GapAB) and CP12 are two major players in controlling the inactivation of the Calvin cycle in land plants at night. GapB originated from a GapA gene duplication and differs from GapA by the presence of a specific C-terminal extension that was recruited from CP12. While GapA and CP12 are assumed to be generally present in the Plantae (glaucophytes, red and green algae, and plants), up to now GapB was exclusively found in Streptophyta, including the enigmatic green alga Mesostigma viride. However, here we show that two closely related prasinophycean green algae, Ostreococcus tauri and Ostreococcus lucimarinus, also possess a GapB gene, while CP12 is missing. This remarkable finding either antedates the GapA/B gene duplication or indicates a lateral recruitment. Moreover, Ostreococcus is the first case where the crucial CP12 function may be completely replaced by GapB-mediated GapA/B aggregation.
  294. Van de Peer, Y. (2007). The future for plants and plants for the future. GENOME BIOLOGY.
    A report of the 2007 EMBO Conference Series on Plant Molecular Biology ‘From basic genomics to systems biology’, Ghent, Belgium, 2-4 May 2007
  295. Fawcett, J., Rombauts, S., Pattyn, P., Sterck, L., & Van de Peer, Y. (2007). The annotation and analysis of the genome of Arabidopsis lyrata. GENES & GENETIC SYSTEMS (Vol. 82, pp. 520–520). Presented at the 79th Annual meeting of the Genetics Society of Japan.
  296. Derelle, E., Ferraz, C., Rombauts, S., Rouzé, P., Worden, A. Z., Robbens, S., Partensky, F., et al. (2006). Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 103(31), 11647–11652.
    The green lineage is reportedly 1,500 million years old, evolving shortly after the endosymbiosis event that gave rise to early photosynthetic eukaryotes. In this study, we unveil the complete genome sequence of an ancient member of this lineage, the unicellular green alga Ostreococcus tauri (Prasinophyceae). This cosmopolitan marine primary producer is the world's smallest free-living eukaryote known to date. Features likely reflecting optimization of environmentally relevant pathways, including resource acquisition, unusual photosynthesis apparatus, and genes potentially involved in C-4 photosynthesis, were observed, as was downsizing of many gene families. Overall, the 12.56-Mb nuclear genome has an extremely high gene density, in part because of extensive reduction of intergenic regions and other forms of compaction such as gene fusion. However, the genome is structurally complex. It exhibits previously unobserved levels of heterogeneity for a eukaryote. Two chromosomes differ structurally from the other eighteen. Both have a significantly biased G+C content, and, remarkably, they contain the majority of transposable elements. Many chromosome 2 genes also have unique codon usage and splicing, but phylogenetic analysis and composition do not support alien gene origin. In contrast, most chromosome 19 genes show no similarity to green lineage genes and a large number of them are specialized in cell surface processes. Taken together, the complete genome sequence, unusual features, and downsized gene families, make O. tauri an ideal model system for research on eukaryotic genome evolution, including chromosome specialization and green lineage ancestry.
  297. Bonet, Iris, Saeys, Y., Ábalo, R. G., García, M. M., Sanchez, R., & Van de Peer, Y. (2006). Feature extraction using clustering of protein. (J. F. Martínez Trinidad, J. A. Carrasco Ochoa, & J. Kittler, Eds.)Lecture Notes in Computer Science, 4225, 614–623. Presented at the 11th Iberoamerican conference in Pattern Recognition (CIARP 2006).
    In this paper we investigate the usage of a clustering algorithm as a feature extraction technique to find new features to represent the protein sequence. In particular, our work focuses on the prediction of HIV protease resistance to drugs. We use a biologically motivated similarity function based on the contact energy of the amino acid and the position in the sequence. The performance measure was computed taking into account the clustering reliability and the classification validity. An SVM using 10-fold crossvalidation and the k-means algorithm were used for classification and clustering respectively. The best results were obtained by reducing an initial set of 99 features to a lower dimensional feature set of 36-66 features.
  298. Saeys, Yvan, & Van de Peer, Y. (2006). Combining signal processing and machine learning techniques for coding potential prediction. Proceedings of the First International Workshop on Bioinforrmatics Cuba-Flanders 2006.
  299. Saeys, Yvan, & Van de Peer, Y. (2006). Enhancing coding potential prediction for short sequences using complementary sequence features and feature selection. Proceedings of the 15th Dutch Belgian machine learning conference (Benelearn 2006) (pp. 105–112). Presented at the 15th Annual Machine Learning Conference of Belgium and the Netherlands (Benelearn 2006), Ghent, Belgium: Ghent University. Faculty of Sciences.
  300. Faes, P., Minnaert, B., CHRISTIAENS, M., Bonnet, E., Saeys, Y., Stroobandt, D., & Van de Peer, Y. (2006). A Scalable Hardware Accelerator for Comparing Protein Sequences. Proceedings of the First International Conference on Scalable Information Systems. Hong Kong.
  301. Bonet, Isis, García, M. M., Salazar, S., Sanchez, R., Saeys, Y., Van de Peer, Y., & Grau, R. (2006). Predicting human immunodeficiency virus (HIV) drug resistance using recurrent neural networks. In J. A. Seijas, S.-K. Lin, & M. P. Vázquez Tato (Eds.), Proceedings : November 1-30, 2006. Presented at the 10th International electronic conference on Synthetic Organic Chemistry (ECSOC-10), Basel, Switzerland: MDPI.
  302. Saeys, Yvan, Tsiporkova, E., De Baets, B., & Van de Peer, Y. (Eds.). (2006). Annual Machine Learning Conference of Belgium and The Netherlands.
  303. Abeel, T., Saeys, Y., & Van de Peer, Y. (2006). Improved core promoter prediction using ensembles of Support Vector Machines. Proceedings of the 15th Dutch Belgian Machine Learning Conference (Benelearn 2006) (pp. 180–181).
  304. Saeys, Yvan, Degroeve, S., & Van de Peer, Y. (2006). Feature ranking using an EDA-based wrapper approach. In J. A. Lozano, P. Larrañga, I. Inza, & E. Bengotxea (Eds.), Towards a new evolutionary computation : advances in estimation of distribution algorithms (Vol. 192, pp. 243–257). Berlin, Germany: Springer.
  305. Michoel, T., & Van de Peer, Y. (2006). Helicoidal transfer matrix model for inhomogeneous DNA melting. PHYSICAL REVIEW E, 73(1).
    An inhomogeneous helicoidal nearest-neighbor model with continuous degrees of freedom is shown to predict the same DNA melting properties as traditional long-range Ising models, for free DNA molecules in solution, as well as superhelically stressed DNA with a fixed linking number constraint. Without loss of accuracy, the continuous degrees of freedom can be discretized using a minimal number of discretization points, yielding an effective transfer matrix model of modest dimension (d=36). The resulting algorithms to compute DNA melting profiles are both simple and efficient.
  306. Van den Bulcke, T., Lemmens, K., Van de Peer, Y., & Marchal, K. (2006). Inferring transcriptional networks by mining “omics” data. CURRENT BIOINFORMATICS, 1(3), 301–313.
    Inferring comprehensive regulatory networks from high-throughput data is one of the foremost challenges of modem computational biology. As high-throughput expression profiling experiments have gained common ground in many laboratories, different techniques have been proposed to infer transcriptional regulatory networks from them. Furthermore, with the advent of diverse types of high-throughput data, the research in network inference has received a new impulse. The use of diverse types of data, together with the increasing tendency of building the inference on biologically plausible simplifications, allows a more reliable and more complete description of networks. Here, we discuss how the research focus in the field of network inference is increasingly shifting from methods trying to reconstruct networks from a single data type towards integrative approaches dealing with several data sources simultaneously to infer regulatory modules.
  307. Gevers, D., & Van de Peer, Y. (2006). Gene duplicates in vibrio genomes. In F. L. Thompson, B. Austin, & J. Swings (Eds.), The biology of vibrios (pp. 76–83). Washington, DC, USA: ASM Press.
  308. Blomme, T., Vandepoele, K., De Bodt, S., Simillion, C., Maere, S., & Van de Peer, Y. (2006). The gain and loss of genes during 600 million years of vertebrate evolution. GENOME BIOLOGY, 7(5).
    Background: Gene duplication is assumed to have played a crucial role in the evolution of vertebrate organisms. Apart from a continuous mode of duplication, two or three whole genome duplication events have been proposed during the evolution of vertebrates, one or two at the dawn of vertebrate evolution, and an additional one in the fish lineage, not shared with land vertebrates. Here, we have studied gene gain and loss in seven different vertebrate genomes, spanning an evolutionary period of about 600 million years. Results: We show that: first, the majority of duplicated genes in extant vertebrate genomes are ancient and were created at times that coincide with proposed whole genome duplication events; second, there exist significant differences in gene retention for different functional categories of genes between fishes and land vertebrates; third, there seems to be a considerable bias in gene retention of regulatory genes towards the mode of gene duplication ( whole genome duplication events compared to smaller-scale events), which is in accordance with the so-called gene balance hypothesis; and fourth, that ancient duplicates that have survived for many hundreds of millions of years can still be lost. Conclusion: Based on phylogenetic analyses, we show that both the mode of duplication and the functional class the duplicated genes belong to have been of major importance for the evolution of the vertebrates. In particular, we provide evidence that massive gene duplication ( probably as a consequence of entire genome duplications) at the dawn of vertebrate evolution might have been particularly important for the evolution of complex vertebrates.
  309. Baele, Guy, Raes, J., Van de Peer, Y., & Vansteelandt, S. (2006). An improved statistical method for detecting heterotachy in nucleotide sequences. MOLECULAR BIOLOGY AND EVOLUTION, 23(7), 1397–1405.
    The principle of heterotachy states that the substitution rate of sites in a gene can change through time. In this article, we propose a powerful statistical test to detect sites that evolve according to the process of heterotachy. We apply this test to an alignment of 1289 eukaryotic rRNA molecules to 1) determine how widespread the phenomenon of heterotachy is in ribosomal RNA, 2) to test whether these heterotachous sites are nonrandomly distributed, that is, linked to secondary structure features of ribosomal RNA, and 3) to determine the impact of heterotachous sites on the bootstrap support of monophyletic groupings. Our study revealed that with 21 monophyletic taxa, approximately two-thirds of the sites in the considered set of sequences is heterotachous. Although the detected heterotachous sites do not appear bound to specific structural features of the small subunit rRNA, their presence is shown to have a large beneficial influence on the bootstrap support of monophyletic groups. Using extensive testing, we show that this may not be due to heterotachy itself but merely due to the increased substitution rate at the detected heterotachous sites.
  310. De Bodt, Stefanie, Theissen, G., & Van de Peer, Y. (2006). Promoter analysis of MADS-box genes in eudicots through phylogenetic footprinting. MOLECULAR BIOLOGY AND EVOLUTION, 23(6), 1293–1303.
    The MIKC MADS-box gene family has been shaped by extensive gene duplications giving rise to subfamilies of genes with distinct functions and expression patterns. However, within these subfamilies the functional assignment is not that clear-cut, and considerable functional redundancy exists. One way to investigate the diversity in regulation present in these subfamilies is promoter sequence analysis. With the advent of genome sequencing projects, we are now able to exert a comparative analysis of Arabidopsis and poplar promoters of MADS-box genes belonging to the same subfamily. Based on the principle of phylogenetic footprinting, sequences conserved between the promoters of homologous genes are thought to be functional. Here, we have investigated the evolution of MADS-box genes at the promoter level and show that many genes have diverged in their regulatory sequences after duplication and/or speciation. Furthermore, using phylogenetic footprinting, a distinction can be made between redundancy, neo/nonfunctionalization, and subfunctionalization.
  311. Casneuf, T., De Bodt, S., Raes, J., Maere, S., & Van de Peer, Y. (2006). Nonrandom divergence of gene expression following gene and genome duplications in the flowering plant Arabidopsis thaliana. GENOME BIOLOGY, 7(2).
    Background: Genome analyses have revealed that gene duplication in plants is rampant. Furthermore, many of the duplicated genes seem to have been created through ancient genome-wide duplication events. Recently, we have shown that gene loss is strikingly different for large- and small-scale duplication events and highly biased towards the functional class to which a gene belongs. Here, we study the expression divergence of genes that were created during large- and small-scale gene duplication events by means of microarray data and investigate both the influence of the origin (mode of duplication) and the function of the duplicated genes on expression divergence. Results: Duplicates that have been created by large- scale duplication events and that can still be found in duplicated segments have expression patterns that are more correlated than those that were created by small-scale duplications or those that no longer lie in duplicated segments. Moreover, the former tend to have highly redundant or overlapping expression patterns and are mostly expressed in the same tissues, while the latter show asymmetric divergence. In addition, a strong bias in divergence of gene expression was observed towards gene function and the biological process genes are involved in. Conclusion: By using microarray expression data for Arabidopsis thaliana, we show that the mode of duplication, the function of the genes involved, and the time since duplication play important roles in the divergence of gene expression and, therefore, in the functional divergence of genes after duplication.
  312. Cannon, S. B., Sterck, L., Rombauts, S., Sato, S., Cheung, F., Gouzy, J., Wang, X., et al. (2006). Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 103(40), 14959–14964.
    Genome sequencing of the model legumes, Medicago truncatula and Lotus japonicus, provides an opportunity for large-scale sequence-based comparison of two genomes in the same plant family. Here we report synteny comparisons between these species, including details about chromosome relationships, large-scale synteny blocks, microsynteny within blocks, and genome regions lacking clear correspondence. The Lotus and Medicago genomes share a minimum of 10 large-scale synteny blocks, each with substantial collinearity and frequently extending the length of whole chromosome arms. The proportion of genes syntenic and collinear within each synteny block is relatively homogeneous. Medicago-Lotus comparisons also indicate similar and largely homogeneous gene densities, although gene-containing regions in Mt occupy 20-30% more space than Lj counterparts, primarily because of larger numbers of Mt retrotransposons. Because the interpretation of genome comparisons is complicated by large-scale genome duplications, we describe synteny, synonymous substitutions and phylogenetic analyses to identify and date a probable whole-genome duplication event. There is no direct evidence for any recent large-scale genome duplication in either Medicago or Lotus but instead a duplication predating speciation. Phylogenetic comparisons place this duplication within the Rosid I clade, clearly after the split between legumes and Salicaceae (poplar).
  313. Van de Peer, Y. (2006). When duplicated genes don’t stick to the rules. HEREDITY.
  314. Vandepoele, Klaas, Casneuf, T., & Van de Peer, Y. (2006). Identification of novel regulatory modules in dicotyledonous plants using expression data and comparative genomics. GENOME BIOLOGY, 7(11).
    Background: Transcriptional regulation plays an important role in the control of many biological processes. Transcription factor binding sites (TFBSs) are the functional elements that determine transcriptional activity and are organized into separable cis-regulatory modules, each defining the cooperation of several transcription factors required for a specific spatio-temporal expression pattern. Consequently, the discovery of novel TFBSs in promoter sequences is an important step to improve our understanding of gene regulation. Results: Here, we applied a detection strategy that combines features of classic motif overrepresentation approaches in co-regulated genes with general comparative footprinting principles for the identification of biologically relevant regulatory elements and modules in Arabidopsis thaliana, a model system for plant biology. In total, we identified 80 TFBSs and 139 regulatory modules, most of which are novel, and primarily consist of two or three regulatory elements that could be linked to different important biological processes, such as protein biosynthesis, cell cycle control, photosynthesis and embryonic development. Moreover, studying the physical properties of some specific regulatory modules revealed that Arabidopsis promoters have a compact nature, with cooperative TFBSs located in close proximity of each other. Conclusion: These results create a starting point to unravel regulatory networks in plants and to study the regulation of biological processes from a systems biology point of view.
  315. Bonnet, E., Van de Peer, Y., & Rouzé, P. (2006). The small RNA world of plants. NEW PHYTOLOGIST, 171(3), 451–468.
    RNA has many functions in addition to being a simple messenger between the genome and the proteome. Over two decades, several classes of small noncoding RNAs c. 21 nucleotides (nt) long have been uncovered in eukaryotic genomes, which appear to play a central role in diverse and fundamental processes. In plants, small RNA-based mechanisms are involved in genome stability, gene expression and defense. Many of the discoveries in this new 'small RNA world' were made by plant biologists. Here, we discuss the three major classes of small RNAs that are found in the plant kingdom, namely small interfering RNAs, microRNAs, and the recently discovered trans-acting small interfering RNAs. Recent results shed light on the identification, integration and specialization of the different components (Dicer-like, Argonaute, and others) involved in the biogenesis of the different classes of small RNAs in plants. Owing to the development of better experimental and computational methods, an ever increasing number of small noncoding RNAs are uncovered in different plant genomes. In particular the well-studied microRNAs seem to act as key regulators in several different developmental pathways, with a marked preference for transcription factors as targets. In addition, an increasing amount of data suggest that they also play an important role in other mechanisms, such as response to stress or environmental changes.
  316. Tuskan, G., DiFazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Putnam, N., et al. (2006). The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). SCIENCE, 313(5793), 1596–1604.
    We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event survived in the Populus genome. A second, older duplication event is indistinguishably coincident with the divergence of the Populus and Arabidopsis lineages. Nucleotide substitution, tandem gene duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in Populus than in Arabidopsis. Populus has more protein-coding genes than Arabidopsis, ranging on average from 1.4 to 1.6 putative Populus homologs for each Arabidopsis gene. However, the relative frequency of protein domains in the two genomes is similar. Overrepresented exceptions in Populus include genes associated with lignocellulosic wall biosynthesis, meristem development, disease resistance, and metabolite transport.
  317. Degroeve, S., Saeys, Y., De Baets, B., Rouzé, P., & Van de Peer, Y. (2005). SpliceMachine: predicting splice sites from high-dimensional local context representations. BIOINFORMATICS, 21(8), 1332–1338.
    Motivation: In this age of complete genome sequencing, finding the location and structure of genes is crucial for further molecular research. The accurate prediction of intron boundaries largely facilitates the correct prediction of gene structure in nuclear genomes. Many tools for localizing these boundaries on DNA sequences have been developed and are available to researchers through the internet. Nevertheless, these tools still make many false positive predictions. Results: This manuscript presents a novel publicly available splice site prediction tool named SpliceMachine that (i) shows state-of-the-art prediction performance on Arabidopsis thaliana and human sequences, (ii) performs a computationally fast annotation and (iii) can be trained by the user on its own data.
  318. Coenye, T., Gevers, D., Van de Peer, Y., Vandamme, P., & Swings, J. (2005). Towards a prokaryotic genomic taxonomy. FEMS MICROBIOLOGY REVIEWS, 29(2), 147–167.
  319. Maere, S., De Bodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper, M., & Van de Peer, Y. (2005). Modeling gene and genome duplications in eukaryotes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 102(15), 5454–5459.
    Recent analysis of complete eukaryotic genome sequences has revealed that gene duplication has been rampant. Moreover, next to a continuous mode of gene duplication, in many eukaryotic organisms the complete genome has been duplicated in their evolutionary past. Such large-scale gene duplication events have been associated with important evolutionary transitions or major leaps in development and adaptive radiations of species. Here, we present an evolutionary model that simulates the duplication dynamics of genes, considering genome-wide duplication events and a continuous mode of gene duplication. Modeling the evolution of the different functional categories of genes assesses the importance of different duplication events for gene families involved in specific functions or processes. By applying our model to the Arabidopsis genome, for which there is compelling evidence for three whole-genome duplications, we show that gene loss is strikingly different for large-scale and small-scale duplication events and highly biased toward certain functional classes. We provide evidence that some categories of genes were almost exclusively expanded through large-scale gene duplication events. In particular, we show that the three whole-genome duplications in Arabidopsis have been directly responsible for >90% of the increase in transcription factors, signal transducers, and developmental genes in the last 350 million years. Our evolutionary model is widely applicable and can be used to evaluate different assumptions regarding small- or large-scale gene duplication events in eukaryotic genomes.
  320. Robbens, S., Khadaroo, B., Camasses, A., Derelle, E., Ferraz, C., Inzé, D., Van de Peer, Y., et al. (2005). Genome-wide analysis of core cell cycle genes in the unicellular green alga Ostreococcus tauri. MOLECULAR BIOLOGY AND EVOLUTION, 22(3), 589–597.
    The cell cycle has been extensively studied in various organisms, and the recent access to an overwhelming amount of genomic data has given birth to a new integrated approach called comparative genomics. Comparing the cell cycle across species shows that its regulation is evolutionarily conserved; the best-known example is the pivotal role of cyclin-dependent kinases in all the eukaryotic lineages hitherto investigated. Interestingly, the molecular network associated with the activity of the CDK-cyclin complexes is also evolutionarily conserved, thus, defining a core cell cycle set of genes together with lineage-specific adaptations. In this paper, we describe the core cell cycle genes of Ostreococcus tauri, the smallest free-living eukaryotic cell having a minimal cellular organization with a nucleus, a single chloroplast, and only one mitochondrion. This unicellular marine green alga, which has diverged at the base of the green lineage, shows the minimal yet complete set of core cell cycle genes described to date. It has only one homolog of CDKA, CDKB, CDKD, cyclin A, cyclin B, cyclin D, cyclin H, Cks, Rb, E2F, DP, DEL, Cdc25, and Wee L We have also added the APC and SCF E3 ligases to the core cell cycle gene set. We discuss the potential of genome-wide analysis in the identification of divergent orthologs of cell cycle genes in different lineages by mining the genomes of evolutionarily important and strategic organisms.
  321. Sterck, L., Rombauts, S., Jansson, S., Sterky, F., Rouzé, P., & Van de Peer, Y. (2005). EST data suggest that poplar is an ancient polyploid. NEW PHYTOLOGIST, 167(1), 165–170.
    We analysed the publicly available expressed sequence tag (EST) collections for the genus Populus to examine whether evidence can be found for large-scale gene-duplication events in the evolutionary past of this genus. The ESTs were clustered into unigenes for each poplar species examined. Gene families were constructed for all proteins deduced from these unigenes, and K-S dating was performed on all paralogs within a gene family. The fraction of paralogs was then plotted against the K-S values, which resulted in a distribution reflecting the age of duplicated genes in poplar. Sufficient EST data were available for seven different poplar species spanning four of the six sections of the genus Populus. For all these species, there was evidence that a large-scale gene-duplication event had occurred. From our analysis it is clear that all poplar species have shared the same large-scale gene-duplication event, suggesting that this event must have occurred in the ancestor of poplar, or at least very early in the evolution of the Populus genus.
  322. Vandepoele, Klaas, & Van de Peer, Y. (2005). Exploring the plant transcriptome through phylogenetic profiling. PLANT PHYSIOLOGY, 137(1), 31–42.
    Publicly available protein sequences represent only a small fraction of the full catalog of genes encoded by the genomes of different plants, such as green algae, mosses, gymnosperms, and angiosperms. By contrast, an enormous amount of expressed sequence tags (ESTs) exists for a wide variety of plant species, representing a substantial part of all transcribed plant genes. Integrating protein and EST sequences in comparative and evolutionary analyses is not straightforward because of the heterogeneous nature of both types of sequence data. By combining information from publicly available EST and protein sequences for 32 different plant species, we identified more than 250,000 plant proteins organized in more than 12,000 gene families. Approximately 60% of the proteins are absent from current sequence databases but provide important new information about plant gene families. Analysis of the distribution of gene families over different plant species through phylogenetic profiling reveals interesting insights into plant gene evolution, and identifies species- and lineage-specific gene families, orphan genes, and conserved core genes across the green plant lineage. We counted a similar number of approximately 9,500 gene families in monocotyledonous and eudicotyledonous plants and found strong evidence for the existence of at least 33,700 genes in rice (Oryza sativa). Interestingly, the larger number of genes in rice compared to Arabidopsis (Arabidopsis thaliana) can partially be explained by a larger amount of species-specific single-copy genes and species-specific gene families. In addition, a majority of large gene families, typically containing more than 50 genes, are bigger in rice than Arabidopsis, whereas the opposite seems true for small gene families.
  323. De Bodt, Stefanie, Maere, S., & Van de Peer, Y. (2005). Genome duplication and the origin of angiosperms. TRENDS IN ECOLOGY & EVOLUTION.
    Despite intensive research, little is known about the origin of the angiosperms and their rise to ecological dominance during the Early Cretaceous. Based on whole-genome analyses of Arabidopsis thaliana, there is compelling evidence that angiosperms underwent two whole-genome duplication events early during their evolutionary history. Recent studies have shown that these events were crucial for the creation of many important developmental and regulatory genes found in extant angiosperm genomes. Here, we argue that these ancient polyploidy events might have also had an important role in the origin and diversification of the angiosperms.
  324. Meyer, Axel, & Van de Peer, Y. (2005). From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). BIOESSAYS, 27(9), 937–945.
  325. Raes, Jeroen, & Van de Peer, Y. (2005). Functional divergence of proteins through frameshift mutations. TRENDS IN GENETICS, 21(8), 428–431.
    Frameshift mutations are generally considered to be deleterious and of little importance for the evolution of novel gene functions. However, by screening an exhaustive set of vertebrate gene families, we found that, when a second transcript encoding the original gene product compensates for this mutation, frameshift mutations can be retained for millions of years and enable new gene functions to be acquired.
  326. Paterson, A. H., Bowers, J. E., Van de Peer, Y., & Vandepoele, K. (2005). Ancient duplication of cereal genomes. NEW PHYTOLOGIST.
  327. Vandepoele, Klaas, Vlieghe, K., Florquin, K., Hennig, L., Beemster, G., Gruissem, W., Van de Peer, Y., et al. (2005). Genome-wide identification of potential plant E2F target genes. PLANT PHYSIOLOGY, 139(1), 316–328.
    Entry into the S phase of the cell cycle is controlled by E2F transcription factors that induce the transcription of genes required for cell cycle progression and DNA replication. Although the E2F pathway is highly conserved in higher eukaryotes, only a few E2F target genes have been experimentally validated in plants. We have combined microarray analysis and bioinformatics tools to identify plant E2F-responsive genes. Promoter regions of genes that were induced at the transcriptional level in Arabidopsis ( Arabidopsis thaliana) seedlings ectopically expressing genes for the E2Fa and DPa transcription factors were searched for the presence of E2F- binding sites, resulting in the identification of 181 putative E2F target genes. In most cases, the E2F- binding element was located close to the transcription start site, but occasionally could also be localized in the 5'untranslated region. Comparison of our results with available microarray data sets from synchronized cell suspensions revealed that the E2F target genes were expressed almost exclusively during G1 and S phases and activated upon reentry of quiescent cells into the cell cycle. To test the robustness of the data for the Arabidopsis E2F target genes, we also searched for the presence of E2F-cis-acting elements in the promoters of the putative orthologous rice ( Oryza sativa) genes. Using this approach, we identified 70 potential conserved plant E2F target genes. These genes encode proteins involved in cell cycle regulation, DNA replication, and chromatin dynamics. In addition, we identified several genes for potentially novel S phase regulatory proteins.
  328. Van Hellemont, R., Monsieurs, P., Thijs, G., De Moor, B., Van de Peer, Y., & Marchal, K. (2005). A novel approach to identifying regulatory motifs in distantly related genomes. GENOME BIOLOGY, 6(13).
    Although proven successful in the identification of regulatory motifs, phylogenetic footprinting methods still show some shortcomings. To assess these difficulties, most apparent when applying phylogenetic footprinting to distantly related organisms, we developed a two-step procedure that combines the advantages of sequence alignment and motif detection approaches. The results on well-studied benchmark datasets indicate that the presented method outperforms other methods when the sequences become either too long or too heterogeneous in size.
  329. Van de Peer, Y., & Meyer, A. (2005). Large-scale gene and ancient genome duplications. In T. R. Gregory (Ed.), The evolution of the genome (pp. 329–368). Amsterdam, The Netherlands: Elsevier Academic Press.
  330. Florquin, K., Saeys, Y., Degroeve, S., Rouzé, P., & Van de Peer, Y. (2005). Large-scale structural analysis of the core promoter in mammalian and plant genomes. NUCLEIC ACIDS RESEARCH, 33(13), 4255–4264.
    DNA encodes at least two independent levels of functional information. The first level is for encoding proteins and sequence targets for DNA-binding factors, while the second one is contained in the physical and structural properties of the DNA molecule itself. Although the physical and structural properties are ultimately determined by the nucleotide sequence itself, the cell exploits these properties in a way in which the sequence itself plays no role other than to support or facilitate certain spatial structures. In this work, we focus on these structural properties, comparing them between different organisms and assessing their ability to describe the core promoter. We prove the existence of distinct types of core promoters, based on a clustering of their structural profiles. These results indicate that the structural profiles are much conserved within plants (Arabidopsis and rice) and animals (human and mouse), but differ considerably between plants and animals. Furthermore, we demonstrate that these structural profiles can be an alternative way of describing the core promoter, in addition to more classical motif or IUPAC-based approaches. Using the structural profiles as discriminatory elements to separate promoter regions from non-promoter regions, reliable models can be built to identify core-promoter regions using a strictly computational approach.
  331. Florquin, K., Saeys, Y., Degroeve, S., & Van de Peer, Y. (2005). Large-scale structural analysis of the core promoter in mammalian and plant genomes. Proceedings of the 7th International EMBL PhD Symposium, Heidelberg, Germany. Presented at the 7th International EMBL PhD Symposium.
  332. Robbens, S., Rombauts, S., Rouzé, P., Wuyts, J., Saeys, Y., Moreau, H., & Van de Peer, Y. (2005). Genome analysis of the world’s smallest free-living eukaryote Ostreococcus tauri unveils unique genome heterogeneity. Proceedings of the Molecular Biology and Evolution Conference (MBE) 2005.
  333. Gevers, D., Cohan, F. M., Lawrence, J. G., Spratt, B. G., Coenye, T., Feil, E. J., Stackebrandt, E., et al. (2005). Re-evaluating prokaryotic species. NATURE REVIEWS MICROBIOLOGY, 3(9), 733–739.
    There is no widely accepted concept of species for prokaryotes, and assignment of isolates to species is based on measures of phenotypic or genome similarity. The current methods for defining prokaryotic species are inadequate and incapable of keeping pace with the levels of diversity that are being uncovered in nature. Prokaryotic taxonomy is being influenced by advances in microbial population genetics, ecology and genomics, and by the ease with which sequence data can be obtained. Here, we review the classical approaches to prokaryotic species definition and discuss the current and future impact of multilocus nucleotide-sequence-based approaches to prokaryotic systematics. We also consider the potential, and difficulties, of assigning species status to biologically or ecologically meaningful sequence clusters.
  334. BEYSEN, D., Raes, J., Leroy, B., Lucassen, A., Yates, J., Clayton-Smith, J., Ilyina, H., et al. (2005). Deletions involving long-range conserved nongenic sequences upstream and downstream of FOXL2 as a novel disease-causing mechanism in Blepharophimosis syndrome. AMERICAN JOURNAL OF HUMAN GENETICS, 77(2), 205–218.
    The expression of a gene requires not only a normal coding sequence but also intact regulatory regions, which can be located at large distances from the target genes, as demonstrated for an increasing number of developmental genes. In previous mutation studies of the role of FOXL2 in blepharophimosis syndrome (BPES), we identified intragenic mutations in 70% of our patients. Three translocation breakpoints upstream of FOXL2 in patients with BPES suggested a position effect. Here, we identified novel microdeletions outside of FOXL2 in cases of sporadic and familial BPES. Specifically, four rearrangements, with an overlap of 126 kb, are located 230 kb upstream of FOXL2, telomeric to the reported translocation breakpoints. Moreover, the shortest region of deletion overlap (SRO) contains several conserved nongenic sequences (CNGs) harboring putative transcription-factor binding sites and representing potential long-range cis-regulatory elements. Interestingly, the human region orthologous to the 12-kb sequence deleted in the polled intersex syndrome in goat, which is an animal model for BPES, is contained in this SRO, providing evidence of human-goat conservation of FOXL2 expression and of the mutational mechanism. Surprisingly, in a fifth family with BPES, one rearrangement was found downstream of FOXL2. In addition, we report nine novel rearrangements encompassing FOXL2 that range from partial gene deletions to submicroscopic deletions. Overall, genomic rearrangements encompassing or outside of FOXL2 account for 16% of all molecular defects found in our families with BPES. In summary, this is the first report of extragenic deletions in BPES, providing further evidence of potential long-range cis-regulatory elements regulating FOXL2 expression. It contributes to the enlarging group of developmental diseases caused by defective distant regulation of gene expression. Finally, we demonstrate that CNGs are candidate regions for genomic rearrangements in developmental genes.
  335. Bonnet, E., Wuyts, J., Rouzé, P., & Van de Peer, Y. (2004). Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. BIOINFORMATICS, 20(17), 2911–2917.
    Motivation: Most non-coding RNAs are characterized by a specific secondary and tertiary structure that determines their function. Here, we investigate the folding energy of the secondary structure of non-coding RNA sequences, such as microRNA precursors, transfer RNAs and ribosomal RNAs in several eukaryotic taxa. Statistical biases are assessed by a randomization test, in which the predicted minimum free energy of folding is compared with values obtained for structures inferred from randomly shuffling the original sequences. Results: In contrast with transfer RNAs and ribosomal RNAs, the majority of the microRNA sequences clearly exhibit a folding free energy that is considerably lower than that for shuffled sequences, indicating a high tendency in the sequence towards a stable secondary structure. A possible usage of this statistical test in the framework of the detection of genuine miRNA sequences is discussed.
  336. Simillion, C., Vandepoele, K., & Van de Peer, Y. (2004). Recent developments in computational approaches for uncovering genomic homology. BIOESSAYS, 26(11), 1225–1235.
  337. Van de Peer, Y. (2004). Computational approaches to unveiling ancient genome duplications. NATURE REVIEWS GENETICS, 5(10), 752–763.
    Recent analyses of complete genome sequences have revealed that many genomes have been duplicated in their evolutionary past. Such events have been associated with important biological transitions, major leaps in evolution and adaptive radiations of species. Here, we consider recently developed computational methods to detect such ancient large-scale gene duplication events. Several new approaches have been used to show that large-scale gene duplications are more common than previously thought.
  338. Bonnet, E., Wuyts, J., Rouzé, P., & Van de Peer, Y. (2004). Detection of 91 potential in plant conserved plant microRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 101(31), 11511–11516.
    MicroRNAs (miRNAs) are an extensive class of tiny RNA molecules that regulate the expression of target genes by means of complementary base pair interactions. Although the first miRNAs were discovered in Caenorhabditis elegans, >300 miRNAs were recently documented in animals and plants, both by cloning methods and computational predictions. We present a genome-wide computational approach to detect miRNA genes in the Arabidopsis thaliana genome. Our method is based on the conservation of short sequences between the genomes of Arabidopsis and rice (Oryza sativa) and on properties of the secondary structure of the miRNA precursor. The method was fine-tuned to take into account plant-specific properties, such as the variable length of the miRNA precursor sequences. In total, 91 potential miRNA genes were identified, of which 58 had at least one nearly perfect match with an Arabidopsis mRNA, constituting the potential targets of those miRNAs. In addition to already known transcription factors involved in plant development, the targets also comprised genes involved in several other cellular processes, such as sulfur assimilation and ubiquitin-dependent protein degradation. These findings considerably broaden the scope of miRNA functions in plants.
  339. Alvares, L. E., Wuyts, J., Van de Peer, Y., Silva, E. P., Coutinho, L. L., Brison, O., & Ruiz, I. R. (2004). The 18S rRNA from Odontophrynus americanus 2n and 4n (Amphibia, Anura) reveals unusual extra sequences in the variable region V2. GENOME, 47(3), 421–428.
    The nucleotide sequence of the rDNA 18S region isolated from diploid and tetraploid species of the amphibian Odontophrynus americanus was determined and used to predict the secondary structure of the corresponding 18S rRNA molecules. Comparison of the primary and secondary structures for the 2n and 4n species confirmed that these species are very closely related. Only three nucleotide substitutions were observed, accounting for 99% identity between the 18S sequences, whereas several changes were detected by comparison with the Xenopus laevis 18S sequence (96% identity). Most changes were located in highly variable regions of the molecule. A noticeable feature of the Odontophrynus 18S rRNA was the presence of unusual extra sequences in the V2 region, between helices 9 and 11. These extra sequences do not fit the model for secondary structure predicted for vertebrate 18S rRNA.
  340. Vandepoele, K., De Vos, W., Taylor, J. S., Meyer, A., & Van de Peer, Y. (2004). Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-finned fishes and land vertebrates. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 101(6), 1638–1643.
    It has been suggested that fish have more genes than humans. Whether most of these additional genes originated through a complete (fish-specific) genome duplication or through many lineage-specific tandem gene or smaller block duplications and family expansions continues to be debated. We analyzed the complete genome of the pufferfish Takifugu rubripes (Fugu) and compared it with the paranome of humans. We show that most paralogous genes of Fugu are the result of three complete genome duplications. Both relative and absolute dating of the complete predicted set of protein-coding genes suggest that initial genome duplications, estimated to have occurred at least 600 million years ago, shaped the genome of all vertebrates, In addition, analysis of >150 block duplications in the Fugu genome clearly supports a fish-specific genome duplication (approximate to320 million years ago) that coincided with the vast radiation of most modern ray-finned fishes. Unlike the human genome, Fugu contains very few recently duplicated genes; hence, many human genes are much younger than fish genes. This lack of recent gene duplication, or, alternatively, the accelerated rate of gene loss, is possibly one reason for the drastic reduction of the genome size of Fugu observed during the past 100 million years or so, subsequent to the additional genome duplication that ray-finned fishes but not land vertebrates experienced.
  341. Khadaroo, B., Robbens, S., Ferraz, C., Derelle, E., Eychenié, S., Cooke, R., Peaucellier, G., et al. (2004). The first green lineage cdc25 dual-specificity phosphatase. CELL CYCLE, 3(4), 513–518.
    The Cdc25 protein phosphatase is a key enzyme involved in the regulation of the G(2)/M transition in metazoans and yeast. However, no Cdc25 ortholog has so far been identified in plants, although functional studies have shown that an activating dephosphorylation of the CDK-cyclin complex regulates the G(2)/M transition. In this paper, the first green lineage Cdc25 ortholog is described in the unicellular alga Ostreococcus tauri. It encodes a protein which is able to rescue the yeast S. pombe cdc25-22 conditional mutant. Furthermore, microinjection of GST-tagged O. tauri Cdc25 specifically activates prophase-arrested starfish oocytes. In vitro histone H1 kinase assays and anti-phosphotyrosine Western Blotting confirmed the in vivo activating dephosphorylation of starfish CDK1-cyclinB by recombinant O. tauri Cdc25. We propose that there has been coevolution of the regulatory proteins involved in the control of M-phase entry in the metazoan, yeast and green lineages.
  342. Gevers, D., Vandepoele, K., Simillion, C., & Van de Peer, Y. (2004). Gene duplication and biased functional retention of paralogs in bacterial genomes. TRENDS IN MICROBIOLOGY, 12(4), 148–154.
    Gene duplication is considered an important prerequisite for gene innovation that can facilitate adaptation to changing environments. The analysis of 106 bacterial genome sequences has revealed the existence of a significant number of paralogs. Analysis of the functional classification of these paralogs reveals a preferential enrichment in functional classes that are involved in transcription, metabolism and defense mechanisms. From the organization of paralogs in the genome we can conclude that duplicated genes in bacteria appear to have been mainly created by small-scale duplication events, such as tandem and operon duplications.
  343. Wuyts, Jan, Perrière, G., & Van de Peer, Y. (2004). The European ribosomal RNA database. NUCLEIC ACIDS RESEARCH, 32, D101–D103.
    The European ribosomal RNA database aims to compile all complete or nearly complete ribosomal RNA sequences from both the small (SSU) and large (LSU) ribosomal subunits. All sequences are available in aligned format. Sequence alignment is based on the secondary structure of the molecules, as determined by comparative sequence analysis. Additional information about the sequences, such as taxonomic classification of the organism from which they have been obtained, and literature references are also provided. In order to identify the closest relatives to newly determined sequences, BLAST searches can be performed, after which the best matching sequences are aligned and a phylogenetic tree is inferred. As of 2003, the European ribosomal RNA database is maintained at Ghent University (Belgium). The database can be consulted at http://www.psb.ugent.be/rRNA/.
  344. Saeys, Yvan, Degroeve, S., & Van de Peer, Y. (2004). Digging into acceptor splice site prediction : an iterative feature selection approach. (J.-F. Boulicaut, F. Esposito, F. Giannotti, & D. Pedreschi, Eds.)LECTURE NOTES IN ARTIFICIAL INTELLIGENCE, 3202, 386–397. Presented at the 8th European conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004).
    Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets.
  345. Saeys, Y., Degroeve, S., Aeyels, D., Rouzé, P., & Van de Peer, Y. (2004). Feature selection for splice site prediction: A new method using EDA-based feature ranking. BMC BIOINFORMATICS, 5, 64.–64.11.
    Background: The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. Results: In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. Conclusion: We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features ( as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features.
  346. Saeys, Y., Degroeve, S., Aeyels, D., Rouzé, P., & Van de Peer, Y. (2004). Selecting relevant features for gene structure prediction. In A. Nowé, T. Lenaerts, & K. Steenhaut (Eds.), Proceedings of Benelearn 2004 (pp. 103–109). VUB Press.
  347. Trindade, G. S., da Fonseca, F. G., Marques, J. T., Diniz, S., Leite, J. A., De Bodt, S., Van de Peer, Y., et al. (2004). Belo Horizonte virus: a vaccinia-like virus lacking the A-type inclusion body gene isolated from infected mice. JOURNAL OF GENERAL VIROLOGY, 85(7), 2015–2021.
    Here is described the isolation of a naturally occurring A-type inclusion body (ATI)-negative vaccinia-like virus, Belo Horizonte virus (VBH), obtained from a mousepox-like outbreak in Brazil. The isolated virus was identified and characterized as an orthopoxvirus by conventional methods. Molecular characterization of the virus was done by DNA cross-hybridization using Vaccinia virus (VACV) DNA. In addition, conserved orthopoxvirus genes such as vaccinia growth factor, thymidine kinase and haemagglutinin were amplified by PCR and sequenced. All sequences presented high similarity to VACV genes. Based on the sequences, phenograms were constructed for comparison with other poxviruses; VBH clustered consistently with VACV strains. Attempts to amplify the ATI gene (ati) by PCR, currently used to identify orthopoxviruses, were unsuccessful. Results presented here suggest that most of the ati gene is deleted in the VBH genome.
  348. Van de Peer, Y. (2004). Tetraodon genome confirms Takifugu findings : most fish are ancient polyploids. GENOME BIOLOGY, 5(12).
    An evolutionary hypothesis suggested by studies of the genome of the tiger pufferfish Takifugu rubripes has now been confirmed by comparison with the genome of a close relative, the spotted green pufferfish Tetraodon nigroviridis. Ray-finned fish underwent a whole-genome duplication some 350 million years ago that might explain their evolutionary success.
  349. Simillion, C., Vandepoele, K., Saeys, Y., & Van de Peer, Y. (2004). Building genomic profiles for uncovering segmental homology in the twilight zone. GENOME RESEARCH, 14(6), 1095–1106.
    The identification of homologous regions within and between genomes is all essential prerequisite for Studying genome structure and evolution. Different methods already exist that allow detecting homologous regions ill all automated manner. These methods are based either oil finding sequence similarities at the DNA level or on identifying chromosomal regions showing conservation of gene order and content. Especially the latter approach has proven useful for detecting homology between highly divergent chromosomal regions. However, until now, such map-based approaches required that candidate homologous regions show significant collinearity with other segments to be considered as being homologous. Here, we present a novel method that creates profiles combining the gene order and content information of multiple mutually homologous genomic segments. These profiles can be used to scan one or more genomes to detect segments that show significant collinearity with the entire profile but not necessarily with individual segments. When applying this new method to the combined genomes of Arabidopsis and rice, we find additional evidence for ancient duplication events in the rice genome.
  350. Vandepoele, Klaas, Simillion, C., & Van de Peer, Y. (2004). The quest for genomic homology. CURRENT GENOMICS, 5(4), 299–308.
  351. Van de Peer, Y. (2004). “Horizontal” plant biology on the rise. GENOME BIOLOGY.
    A report on the Plant Genomics European Meeting (Plant-GEMS2004), Lyon, France, 22-25 September 2004
  352. Degroeve, S., Saeys, Y., De Baets, B., Van de Peer, Y., & Rouzé, P. (2004). Splice site prediction in eukaryote genome sequences : the algorithmic issues. In J. Seckbach & E. Rubin (Eds.), The new avenues in bioinformatics (pp. 99–111). Dordrecht, The Netherlands: Kluwer Academic.
  353. Simillion, C., Vandepoele, K., Saeys, Y., & Van de Peer, Y. (2004). Building genomic profiles for uncovering segmental homology in the twilight zone. Belgian Bioinformatics Conference, 4th, Abstracts. Presented at the 4th Belgian Bioinformatics Conference (BBC 2004).
  354. Florquin, K., Degroeve, S., Saeys, Y., & Van de Peer, Y. (2004). The role of non-linear DNA structures in transcription. Belgian Bioinformatics Conference, 4th, Abstracts. Presented at the 4th Belgian Bioinformatics Conference (BBC 2004).
  355. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2003). Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. BIOINFORMATICS, 19(suppl. 2), ii179–ii188.
    Motivation: Feature subset selection is an important preprocessing step for classification. In biology, where structures or processes are described by a large number of features, the elimination of irrelevant and redundant information in a reasonable amount of time has a number of advantages. It enables the classification system to achieve good or even better solutions with a restricted subset of features, allows for a faster classification, and it helps the human expert focus on a relevant subset of features, hence providing useful biological knowledge. Results: We present a heuristic method based on Estimation of Distribution Algorithms to select relevant subsets of features for splice site prediction in Arabidopsis thaliana. We show that this method performs a fast detection of relevant feature subsets using the technique of constrained feature subsets. Compared to the traditional greedy methods the gain in speed can be up to one order of magnitude, with results being comparable or even better than the greedy methods. This makes it a very practical solution for classification tasks that can be solved using a relatively small amount of discriminative features (or feature dependencies), but where the initial set of potential discriminative features is rather large.
  356. Raes, Jeroen, & Van de Peer, Y. (2003). Gene duplication, the evolution of novel gene functions, and detecting functional divergence of duplicates in silico. APPLIED BIOINFORMATICS, 2(2), 91–101.
  357. Meyer, Axel, & Van de Peer, Y. (2003). “Natural selection merely modified while redundancy created”: Susumu Ohno’s idear of the evolutionary importance of gene and genome duplications. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS, 3(1-4), VII–IX.
  358. Raes, Jeroen, Vandepoele, K., Simillion, C., Saeys, Y., & Van de Peer, Y. (2003). Investigating ancient duplication events in the Arabidopsis genome. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS, 3(1-4), 117–129.
  359. Van de Peer, Y., Taylor, J. S., & Meyer, A. (2003). Are all fishes ancient polyploids? JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS, 3(1-4), 65–73.
  360. Van de Peer, Y. (2003). Phylogeny inference based on distance methods : theory. In M. Salemi & A.-M. Vandamme (Eds.), The phylogenetic handbook : a practical approach to DNA and protein phylogeny (pp. 101–119). Cambridge, UK: Cambridge University Press.
  361. Van de Peer, Y. (2003). Analysis of nucleotide sequences using TREECON. In M. Salemi & A.-M. Vandamme (Eds.), The phylogenetic handbook : a practical approach to DNA and protein phylogeny (pp. 236–255). Cambridge, UK: Cambridge University Press.
  362. Paraskevis, D., Lemey, P., Salemi, M., Suchard, M., Van de Peer, Y., & Vandamme, A.-M. (2003). Analysis of the evolutionary relationships of HIV-1 and SIVcpz sequences using Bayesian inference: implications for the origin of HIV-1. MOLECULAR BIOLOGY AND EVOLUTION, 20(12), 1986–1996.
    The most plausible origin of HIV-1 group M is an SIV lineage currently represented by SIVcpz isolated from the chimpanzee subspecies Pan troglodytes troglodytes. The origin of HIV-1 group 0 is less clear. Putative recombination between any of the HIVA-1 and SlVcpz sequences was tested using bootscanning and Bayesian-scanning plots, as well as a new method using a Bayesian multiple change-point (BMCP) model to infer parental sequences and crossing-over points. We found that in the case of highly divergent sequences, such as HIV-1/SIVcpz, Bayesian scanning and BMCP methods are more appropriate than bootscanning analysis to investigate spatial phylogenetic variation, including estimating the boundaries of the regions with discordant evolutionary relationships and the levels of support of the phylogenetic clusters under study. According to the Bayesian scanning plots and BMCP method, there was strong evidence for discordant phylogenetic clustering throughout the genome: (1) HIV-1 group 0 clustered with SIVcpzANT/ TAN in middle pol, and partial vif/env; (2) SIVcpzGab1 clustered with SIVcpzANT/TAN in 3'pol/vif, and middle env; (3) HIV-1 group 0 grouped with SIVcpzCamUS and SIVcpzGab1 in pl7/p24; (4) HIV-1 group M was more closely related to SIVcpzCamUS in 3'gag/pol and in middle pol, whereas in partial gp120 group M clustered with group O. Conditionally independent phylogenetic analysis inferred by maximum likelihood (ML) and Bayesian methods further confirmed these findings. The discordant phylogenetic relationships between the HIV-1/SlVcpz sequences may have been caused by ancient recombination events, but they are also due, at least in part, to altered rates of evolution between parental SIVcpz lineages.
  363. Vlieghe, Kobe, Vuylsteke, M., Florquin, K., Rombauts, S., Maes, S., Ormenese, S., Van Hummelen, P., et al. (2003). Microarray analysis of E2Fa-DPa-overexpressing plants uncovers a cross-talking genetic network between DNA replication and nitrogen assimilation. JOURNAL OF CELL SCIENCE, 116(20), 4249–4259.
    Previously we have shown that overexpression of the heterodimeric E2Fa-DPa transcription factor in Arabidopsis thaliana results in ectopic cell division, increased endoreduplication, and an early arrest in development. To gain a better insight into the phenotypic behavior of E2Fa-DPa transgenic plants and to identify E2Fa-DPa target genes, a transcriptomic microarray analysis was performed. Out of 4,390 unique genes, a total of 188 had a twofold or more up- (84) or down-regulated (104) expression level in E2Fa-DPa transgenic plants compared to wild-type lines. Detailed promoter analysis allowed the identification of novel E2Fa-DPa target genes, mainly involved in DNA replication. Secondarily induced genes encoded proteins involved in cell wall biosynthesis, transcription and signal transduction or had an unknown function. A large number of metabolic genes were modified as well, among which, surprisingly, many genes were involved in nitrate assimilation. Our data suggest that the growth arrest observed upon E2Fa-DPa overexpression results at least partly from a nitrogen drain to the nucleotide synthesis pathway, causing decreased synthesis of other nitrogen compounds, such as amino acids and storage proteins.
  364. Vandenbussche, Michiel, Theißen, G., Van de Peer, Y., & Gerats, T. (2003). Structural diversification and neo-functionalization during floral MADS-box gene evolution by C-terminal frameshift mutations. NUCLEIC ACIDS RESEARCH, 31(15), 4401–4409.
    Frameshift mutations generally result in loss-of-function changes since they drastically alter the protein sequence downstream of the frameshift site, besides creating premature stop codons. Here we present data suggesting that frameshift mutations in the C-terminal domain of specific ancestral MADS-box genes may have contributed to the structural and functional divergence of the MADS-box gene family. We have identified putative frameshift mutations in the conserved C-terminal motifs of the B-function DEF/AP3 subfamily, the A-function SQUA/AP1 subfamily and the E-function AGL2 subfamily, which are all involved in the specification of organ identity during flower development. The newly evolved C-terminal motifs are highly conserved, suggesting a de novo generation of functionality. Interestingly, since the new C-terminal motifs in the A- and B-function subfamilies are only found in higher eudicotyledonous flowering plants, the emergence of these two C-terminal changes coincides with the origin of a highly standardized floral structure. We speculate that the frameshift mutations described here are examples of co-evolution of the different components of a single transcription factor complex. 3' terminal frameshift mutations might provide an important but so far unrecognized mechanism to generate novel functional C-terminal motifs instrumental to the functional diversification of transcription factor families.
  365. MEYER, A., & Van de Peer, Y. (Eds.). (2003). Genome Evolution: Gene and Genome Duplications and the Origin of Novel Gene Functions. Kluwer Academic.
  366. Vandepoele, K., Simillion, C., & Van de Peer, Y. (2003). Evidence that rice and other cereals are ancient aneuploids. PLANT CELL, 15(9), 2192–2202.
    Detailed analyses of the genomes of several model organisms revealed that large-scale gene or even entire-genome duplications have played prominent roles in the evolutionary history of many eukaryotes. Recently, strong evidence has been presented that the genomic structure of the dicotyledonous model plant species Arabidopsis is the result of multiple rounds of entire-genome duplications. Here, we analyze the genome of the monocotyledonous model plant species rice, for which a draft of the genomic sequence was published recently. We show that a substantial fraction of all rice genes (similar to15%) are found in duplicated segments. Dating of these block duplications, their nonuniform distribution over the different rice chromosomes, and comparison with the duplication history of Arabidopsis suggest that rice is not an ancient polyploid, as suggested previously, but an ancient aneuploid that has experienced the duplication of one-or a large part of one-chromosome in its evolutionary past, similar to70 million years ago. This date predates the divergence of most of the cereals, and relative dating by phylogenetic analysis shows that this duplication event is shared by most if not all of them.
  367. Taylor, J. S., Braasch, I., Frickey, T., Meyer, A., & Van de Peer, Y. (2003). Genome duplication, a trait shared by 22,000 species of ray-finned fish. GENOME RESEARCH, 13(3), 382–390.
    Through phylogeny reconstruction we identified 49 genes with a single copy in man, mouse, and chicken, one or two copies in the tetraploid frog Xenopus laevis, and two copies in zebrafish (Danlo rerio). For 22 of these genes, both zebrafish duplicates had orthologs in the pufferfish (Takifugu rubripes). For another 20 of these genes, we found only one pufferfish ortholog but in each case it was more closely related to one of the zebrafish duplicates than to the other. Forty-three pairs of duplicated genes map to 24 of the 25 zebrafish linkage groups but they are not randomly distributed; we identified 10 duplicated regions of the zebrafish genome that each contain between two and five sets of paralogous genes. These phylogeny and synteny data suggest that the common ancestor of zebrafish and pufferfish, a fish that gave rise to similar to22,000 species, experienced a large-scale gene or complete genome duplication event and that the pufferfish has lost many duplicates that the zebrafish has retained.
  368. Van de Peer, Y., Taylor, J. S., & Meyer, A. (2003). Are all fishes ancient polyploids? In Axel Meyer & Y. Van de Peer (Eds.), Genome evolution : gene and genome duplications and the origin of novel gene functions (pp. 65–73). Dordrecht, The Netherlands: Kluwer Academic.
  369. Raes, Jeroen, Vandepoele, K., Simillion, C., Saeys, Y., & Van de Peer, Y. (2003). Investigating ancient duplication events in the Arabidopsis genome. In Axel Meyer & Y. Van de Peer (Eds.), Genome evolution : gene and genome duplications and the origin of novel gene functions (pp. 117–129). Dordrecht, The Netherlands: Kluwer Academic.
  370. Rombauts, S., Florquin, K., Lescot, M., Marchal, K., Rouzé, P., & Van de Peer, Y. (2003). Computational approaches to identify promoters and cis-regulatory elements in plant genomes. PLANT PHYSIOLOGY, 132(3), 1162–1176.
    The identification of promoters and their regulatory elements is one of the major challenges in bioinformatics and integrates comparative, structural, and functional genomics. Many different approaches have been developed to detect conserved motifs in a set of genes that are either coregulated or orthologous. However, although recent approaches seem promising, in general, unambiguous identification of regulatory elements is not straightforward. The delineation of promoters is even harder, due to its complex nature, and in silico promoter prediction is still in its infancy. Here, we review the different approaches that have been developed for identifying promoters and their regulatory elements. We discuss the detection of cis-acting regulatory elements using word-counting or probabilistic methods (so-called "search by signal" methods) and the delineation of promoters by considering both sequence content and structural features ("search by content" methods). As an example of search by content, we explored in greater detail the association of promoters with CpG islands. However, due to differences in sequence content, the parameters used to detect CpG islands in humans and other vertebrates cannot be used for plants. Therefore, a preliminary attempt was made to define parameters that could possibly define CpG and CpNpG islands in Arabidopsis, by exploring the compositional landscape around the transcriptional start site. To this end, a data set of more than 5,000 gene sequences was built, including the promoter region, the 5'-untranslated region, and the first introns and coding exons. Preliminary analysis shows that promoter location based on the detection of potential CpG/CpNpG islands in the Arabidopsis genome is not straightforward. Nevertheless, because the landscape of CpG/ CpNpG islands differs considerably between promoters and introns on the one side and exons (whether coding or not) on the other, more sophisticated approaches can probably be developed for the successful detection of "putative" CpG and CpNpG islands in plants.
  371. Rombauts, S., Van de Peer, Y., & Rouzé, P. (2003). AFLPinSilico, simulating AFLP fingerprints. BIOINFORMATICS, 19(6), 776–777.
    A drawback of the Amplified Fragment Length Polymorphism (AFLP) fingerprinting method is the difficulty to correlate the different fragments with their DNA sequence. The AFLPinSilico application presented here simulates AFLP experiments run on either cDNA or genomic sequences, producing virtual fingerprints that allow high throughput identification of AFLP fragments. The program also enables biologists to manage experiments through simulations done beforehand, thereby reducing the number of experiments that have to be run. AFLPinSilico is available through the www or as a stand-alone version, through a command line executable (available upon request, for any platform running PERL).
  372. De Bodt, Stefanie, Raes, J., Florquin, K., Rombauts, S., Rouzé, P., Theißen, G., & Van de Peer, Y. (2003). Genomewide structural annotation and evolutionary analysis of the type I MADS-box genes in plants. JOURNAL OF MOLECULAR EVOLUTION, 56(5), 573–586.
    The type I MADS-box genes constitute a largely unexplored subfamily of the extensively studied MADS-box gene family, well known for its role in flower development. Genes of the type I MADS-box subfamily possess the characteristic MADS box but are distinguished from type II MADS-box genes by the absence of the keratin-like box. In this in silico study, we have structurally annotated all 47 members of the type I MADS-box gene family in Arabidopsis thaliana and exerted a thorough analysis of the C-terminal regions of the translated proteins. On the basis of conserved motifs in the C-terminal region, we could classify the gene family into three main groups, two of which could be further subdivided. Phylogenetic trees were inferred to study the evolutionary relationships within this large MADS-box gene subfamily. These suggest for plant type I genes a dynamic of evolution that is significantly different from the mode of both animal type I (SRF) and plant type II (MIKC-type) gene phylogeny. The presence of conserved motifs in the majority of these genes, the identification of Oryza sativa MADS-box type I homologues, and the detection of expressed sequence tags for Arabidopsis thaliana and other plant type I genes suggest that these genes are indeed of functional importance to plants. It is therefore even more intriguing that, from an experimental point of view, almost nothing is known about the function of these MADS-box type I genes.
  373. De Bodt, Stefanie, Raes, J., Van de Peer, Y., & Theißen, G. (2003). And then there were many: MADS goes genomic. TRENDS IN PLANT SCIENCE, 8(10), 475–483.
    During the past decade, MADS-box genes have become known as key regulators in both reproductive and vegetative plant development. Traditional genetics and functional genomics tools are now available to elucidate the expression and function of this complex gene family on a much larger scale. Moreover, comparative analysis of the MADS-box genes in diverse flowering and non-flowering plants, boosted by bioinformatics, contributes to our understanding of how this important gene family has expanded during the evolution of land plants. Therefore, the recent advances in comparative and functional genomics; should enable researchers to identify the full range of MADS-box gene functions, which should help us significantly in developing a better understanding of plant development and evolution.
  374. Raes, Jeroen, Rohde, A., Christensen, J. H., Van de Peer, Y., & Boerjan, W. (2003). Genome-wide characterization of the lignification toolbox in Arabidopsis. PLANT PHYSIOLOGY, 133(3), 1051–1071.
    Lignin, one of the most abundant terrestrial biopolymers, is indispensable for plant structure and defense. With the availability of the full genome sequence, large collections of insertion mutants, and functional genomics tools, Arabidopsis constitutes an excellent model system to profoundly unravel the monolignol biosynthetic pathway. In a genome-wide bioinformatics survey of the Arabidopsis genome, 34 candidate genes were annotated that encode genes homologous to the 10 presently known enzymes of the monolignol biosynthesis pathway, nine of which have not been described before. By combining evolutionary analysis of these 10 gene families with in silico promoter analysis and expression data (from a reverse transcription-polymerase chain reaction analysis on an extensive tissue panel, mining of expressed sequence tags from publicly available resources, and assembling expression data from literature), 12 genes could be pinpointed as the most likely candidates for a role in vascular lignification. Furthermore, a possible novel link was detected between the presence of the AC regulatory promoter element and the biosynthesis of G lignin during vascular development. Together, these data describe the full complement of monolignol biosynthesis genes in Arabidopsis, provide a unified nomenclature, and serve as a basis for further functional studies.
  375. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2003). Feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Belgian Bioinformatics Conference, 3rd, Abstracts. Presented at the 3rd Belgian Bioinformatics Conference (BBC 2003).
  376. Degroeve, S., De Baets, B., Van de Peer, Y., & Rouzé, P. (2002). Feature subset selection for splice site prediction. BIOINFORMATICS, 18(suppl. 2), S75–S83. Presented at the European Conference on Computational Biology 2002 (ECCB 2002).
    Motivation: The large amount of available annotated Arabidopsis thaliana sequences allows the induction of splice site prediction models with supervised learning algorithms (see Haussler (1998) for a review and references). These algorithms need information sources or features from which the models can be computed. For splice site prediction, the features we consider in this study are the presence or absence of certain nucleotides in close proximity to the splice site. Since it is not known how many and which nucleotides are relevant for splice site prediction, the set of features is chosen large enough such that the probability that all relevant information sources are in the set is very high. Using only those features that are relevant for constructing a splice site prediction system might improve the system and might also provide us with useful biological knowledge. Using fewer features will of course also improve the prediction speed of the system. Results: A wrapper-based feature subset selection algorithm using a support vector machine or a naive Bayes prediction method was evaluated against the traditional method for selecting features relevant for splice site prediction. Our results show that this wrapper approach selects features that improve the performance against the use of all features and against the use of the features selected by the traditional method.
  377. Lescot, M., Déhais, P., Thijs, G., Marchal, K., Moreau, Y., Van de Peer, Y., Rouzé, P., et al. (2002). PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. NUCLEIC ACIDS RESEARCH, 30(1), 325–327.
    PlantCARE is a database of plant cis-acting regulatory elements, enhancers and repressors. Regulatory elements are represented by positional matrices, consensus sequences and individual sites on particular promoter sequences. Links to the EMBL, TRANSFAC and MEDLINE databases are provided when available. Data about the transcription sites are extracted mainly from the literature, supplemented with an increasing number of in silico predicted data. Apart from a general description for specific transcription factor sites, levels of confidence for the experimental evidence, functional information and the position on the promoter are given as well. New features have been implemented to search for plant cis-acting regulatory elements in a query sequence. Furthermore, links are now provided to a new clustering and motif search method to investigate clusters of co-expressed genes. New regulatory elements can be sent automatically and will be added to the database after curation.
  378. Oborník, M., Van de Peer, Y., Hypša, V., Frickey, T., Šlapeta, J. R., Meyer, A., & Lukeš, J. (2002). Phylogenetic analyses suggest lateral gene transfer from the mitochondrion to the apicoplast. GENE, 285(1-2), 109–118.
  379. Rensing, S. A., Rombauts, S., Van de Peer, Y., & Reski, R. (2002). Moss transcriptome and beyond. TRENDS IN PLANT SCIENCE.
    The ancient land plant Physcomitrella patens is a model system that is becoming increasingly important for plant functional genomics because gene knockouts can be produced with relative ease. Recently, several EST-sequencing projects have been launched as a first step towards a thorough functional characterization of the moss. However, for careful comparison with other plant model systems, the complete genomic sequence is needed as well as the transcriptome.
  380. Simillion, C., Vandepoele, K., Van Montagu, M., Zabeau, M., & Van de Peer, Y. (2002). The hidden duplication past of Arabidopsis thaliana. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 99(21), 13627–13632.
    Analysis of the genome sequence of Arabidopsis thaliana shows that this genome, like that of many other eukaryotic organisms, has undergone large-scale gene duplications or even duplications of the entire genome. However, the high frequency of gene loss after duplication events reduces colinearity and therefore the chance of finding duplicated regions that, at the extreme, no longer share homologous genes. In this study we show that heavily degenerated block duplications that can no longer be recognized by directly comparing two segments because of differential gene loss, can still be detected through indirect comparison with other segments. When these so-called hidden duplications in Arabidopsis are taken into account, many homologous genomic regions can be found in five to eight copies. This finding strongly implies that Arabidopsis has undergone three, but probably no more, rounds of genome duplications. Therefore, adding such hidden blocks to the duplication landscape of Arabidopsis sheds light on the number of polyploidy events that this model plant genome has undergone in its evolutionary past.
  381. Van de Peer, Y., Frickey, T., Taylor, J. S., & Meyer, A. (2002). Dealing with saturation at the amino acid level: a case study based on anciently duplicated zebrafish genes. GENE, 295(2), 205–211. Presented at the 3rd Anton Dohrn Workshop.
  382. Van de Peer, Y., Taylor, J. S., Joseph, J., & Meyer, A. (2002). Wanda : a database of duplicated fish genes. NUCLEIC ACIDS RESEARCH, 30(1), 109–112.
    Comparative genomics has shown that ray-finned fish (Actinopterygii) contain more copies of many genes than other vertebrates. A large number of these additional genes appear to have been produced during a genome duplication event that occurred early during the evolution of Actinopterygii (i.e. before the teleost radiation). In addition to this ancient genome duplication event, many lineages within Actinopterygii have experienced more recent genome duplications. Here we introduce a curated database named Wanda that lists groups of orthologous genes with one copy from man, mouse and chicken, one or two from tetraploid Xenopus and two or more ancient copies (i.e. paralogs) from ray-finned fish. The database also contains the sequence alignments and phylogenetic trees that were necessary for determining the correct orthologous and paralogous relationships among genes. Where available, map positions and functional data are also reported. The Wanda database should be of particular use to evolutionary and developmental biologists who are interested in the evolutionary and functional divergence of genes after duplication. Wanda is available at http://www.evolutionsbiologie.uni-konstanz.de/Wanda/.
  383. Vandepoele, K., Saeys, Y., Simillion, C., Raes, J., & Van de Peer, Y. (2002). The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between Arabidopsis and rice. GENOME RESEARCH, 12(11), 1792–1801.
    It is expected that one of the merits of comparative genomics lies in the transfer of structural and functional information from one genome to another. This is based on the observation that, although the number of chromosomal rearrangements that occur in genomes is extensive, different species still exhibit a certain degree of conservation regarding gene content and gene order. It is in this respect that we have developed a new software tool for the Automatic Detection of Homologous Regions (ADHoRe). ADHoRe was primarily developed to find large regions of microcolinearity, taking into account different types of microrearrangements such as tandem duplications, gene loss and translocations, and inversions. Such rearrangements often complicate the detection of colinearity, in particular when comparing more anciently diverged species. Application of ADHoRe to the complete genome of Arabidopsis and a large collection of concatenated rice BACs yields more than 20 regions showing statistically significant microcolinearity between both plant species. These regions comprise from 4 up to 11 conserved homologous gene pairs. We predict the number of homologous regions and the extent of microcolinearity to increase significantly once better annotations of the rice genome become available.
  384. Vandepoele, Klaas, Simillion, C., & Van de Peer, Y. (2002). Detecting the undetectable : uncovering duplicated segments in Arabidopsis by comparison with rice. TRENDS IN GENETICS, 18(12), 606–608.
    Genome analysis shows that large-scale gene duplications have occurred in fungi, animals and plants, creating genomic regions that show similarity in gene content and order. However the high frequency of gene loss reduces colinearity resulting in duplicated regions that, in the extreme, no longer share homologous genes. Here, we show that by comparison with an appropriate second genome, such paralogous regions can still be identified.
  385. Wuyts, Jan, Van de Peer, Y., Winkelmans, T., & De Wachter, R. (2002). The European database on small subunit ribosomal RNA. NUCLEIC ACIDS RESEARCH, 30(1), 183–185.
    The European database on SSU rRNA can be consulted via the World WideWeb at http://rrna.uia.ac.be/ssu/ and compiles all complete or nearly complete small subunit ribosomal RNA sequences. Sequences are provided in aligned format. The alignment takes into account the secondary structure information derived by comparative sequence analysis of thousands of sequences. Additional information such as literature references, taxonomy, secondary structure models and nucleotide variability maps, is also available.
  386. Ben Ali, A., De Baere, R., De Wachter, R., & Van de Peer, Y. (2002). Evolutionary relationships among heterokont algae (the autotrophic stramenopiles) based on combined analyses of small and large subunit ribosomal RNA. PROTIST, 153(2), 123–132.
    In order to study the phylogenetic relationships within the stramenopiles, and particularly among the heterokont algae, we have determined complete or nearly complete large-subunit ribosomal RNA sequences for different species of raphidophytes, phaeophytes, xanthophytes, chrysophytes, synurophytes and pinguiophytes. With the small- and large-subunit ribosomal RNA sequences of representatives for nearly all known groups of heterokont algae, phylogenetic trees were constructed from a concatenated alignment of both ribosomal RNAs, including more than 5,000 positions. By using different tree construction methods, inferred phylogenies showed phaeophytes and xanthophytes as sister taxa, as well as the pelagophytes and dictyochophytes, and the chrysophytes/synurophytes and eustigmatophytes. All these relationships are highly supported by bootstrap analysis. However, apart from these sister group relationships, very few other internodes are well resolved and most groups of heterokont algae seem to have diverged within a relatively short time frame.
  387. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2002). Selecting Relevant Features for Splice Site Prediction by Estimation of Distribution Algorithms. Proceedings of Benelearn 2002 (pp. 64–71).
  388. Saeys, Yvan, Aeyels, D., Stanssens, P., Van de Peer, Y., & Zabeau, M. (2002). Retrieving DNA sequence information from mass spectra of nucleic acids: application to the detection and identification of SNPs. Belgian Bioinformatics Conference, 2nd, Abstracts. Presented at the 2nd Belgian Bioinformatics Conference (BBC 2002).
  389. Vandepoele, Klaas, Saeys, Y., Simillion, C., RAES, J., & Van de Peer, Y. (2002). Detecting microcolinearity between Arabidopsis and Rice. Proceedings of the 6th Gatersleben Research Conference (2002), “Plant Genetic Resources in the Genomic Era: Genetic Diversity, Genome Evolution and New Applications”.
  390. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2002). Feature subset selection for splice site prediction by estimation of distribution algorithms. Computational Biology, European conference, Abstracts. Presented at the European conference on Computational Biology 2002 (ECCB 2002).
  391. Rombauts, S., Lescot, M., Thijs, G., Marchal, K., Moreau, Y., Déhais, P., Van de Peer, Y., et al. (2002). The PlantCARE database and tools for in silico search of plant cis-acting regulatory elements. JOBIM 2002 : journées ouvertes biologie, informatique, mathématique (pp. 183–184). Presented at the Journées Ouvertes Biologie, Informatique, Mathématique 2002 (JOBIM 2002).
  392. Bonnet, E., & Van de Peer, Y. (2002). zt : a sofware tool for simple and partial Mantel tests. JOURNAL OF STATISTICAL SOFTWARE, 7(10), 1.
    Different methods of data analysis (e.g. clustering and ordination) are based on distance matrices. In some cases, researchers may wish to compare several distance matrices with one another in order to test a hypothesis concerning a possible relationship between these matrices. However, this is not always self-evident. Usually, values in distance matrices are, in some way, correlated and therefore the usual assumption of independence between objects is violated in the classical tests approach. Furthermore, often, spurious correlations can be observed when comparing two distances matrices. A classic example is the comparison between genetic and environmental distances. Colonies that are in close proximity of each other tend to have similar environments and therefore there will be a positive correlation between environmental and geographical distances. Such colonies will also be more likely to exchange migrants so that genetic distances will be positively correlated with spatial distances. The consequence is that an observed positive association between genetic and environmental distances may be simply due to spatial effects. The most widely used method to account for distance correlations is a procedure known as the Mantel test (Mantel, 1967; Mantel and Valand, 1970 following the pioneering work of Daniels, 1944 ; Daniels and Kendall 1947). The simple Mantel test considers two matrices while an extension known as the partial Mantel test considers three matrices. These tools are widely used in different fields of research such as population genetics, ecology, anthropology, psychometrics and sociology.
  393. Van de Peer, Y., Taylor, J. S., Braasch, I., & Meyer, A. (2001). The ghost of selection past: rates of evolution and functional divergence of anciently duplicated genes. JOURNAL OF MOLECULAR EVOLUTION, 53(4-5), 436–446.
    The duplication of genes and even complete genomes may be a prerequisite for major evolutionary transitions and the origin of evolutionary novelties. However, the evolutionary mechanisms of gene evolution and the origin of novel gene functions after gene duplication have been a subject of many debates. Recently, we compiled 26 groups of orthologous genes, which included one gene from human, mouse, and chicken, one or two genes from the tetraploid Xenopus and two genes from zebrafish. Comparative analysis and mapping data showed that these pairs of zebrafish genes were probably produced during a fish-specific genome duplication that occurred between 300 and 450 Mya, before the teleost radiation (Taylor et al. 2001). As discussed here, many of these retained duplicated genes code for DNA binding proteins. Different models have been developed to explain the retention of duplicated genes and in particular the subfunctionalization model of Force et al. (1999) could explain why so many developmental control genes have been retained. Other models are harder to reconcile with this particular set of duplicated genes. Most genes seem to have been subjected to strong purifying selection, keeping properties such as charge and polarity the same in both duplicates, although some evidence was found for positive Darwinian selection, in particular for Hox genes. However, since only the cumulative pattern of nucleotide substitutions can be studied, clear indications of positive Darwinian selection or neutrality may be hard to find for such anciently duplicated genes. Nevertheless, an increase in evolutionary rate in about half of the duplicated genes seems to suggest that either positive Darwinian selection has occurred or that functional constraints have been relaxed at one point in time during functional divergence.
  394. Taylor, J. S., Van de Peer, Y., & Meyer, A. (2001). Revisiting recent challenges to the ancient fish-specific genome duplication hypothesis. CURRENT BIOLOGY, 11(24), R1005–R1007.
  395. Van de Peer, Y. (2001). Phylogeny branches out. NATURE.
  396. Wuyts, Jan, Van de Peer, Y., & De Wachter, R. (2001). Distribution of substitution rates and location of insertion sites in the tertiary structure of ribosomal RNA. NUCLEIC ACIDS RESEARCH, 29(24), 5017–5028.
    The relative substitution rate of each nucleotide site in bacterial small subunit rRNA, large subunit rRNA and 5S rRNA was calculated from sequence alignments for each molecule. Two-dimensional and three-dimensional variability maps of the rRNAs were obtained by plotting the substitution rates on secondary structure models and on the tertiary structure of the rRNAs available from X-ray diffraction results. This showed that the substitution rates are generally low near the centre of the ribosome, where the nucleotides essential for its function are situated, and that they increase towards the surface. An inventory was made of insertions characteristic of the Archaea, Bacteria and Eucarya domains, and for additional insertions present in specific eukaryotic taxa. All these insertions occur at the ribosome surface. The taxon-specific insertions seem to arise randomly in the eukaryotic evolutionary tree, without any phylogenetic relatedness between the taxa possessing them.
  397. Taylor, J. S., Van de Peer, Y., & Meyer, A. (2001). Genome duplication, divergent resolution and speciation. TRENDS IN GENETICS.
    What are the evolutionary consequences of gene duplication? One answer is speciation, according to a model initially called Reciprocal Silencing and recently expanded and renamed Divergent Resolution. This model shows how the loss of different copies of a duplicated gene in allopatric populations (divergent resolution) can promote speciation by genetically isolating these populations should they become reunited. Genome duplication events produce thousands of duplicated genes. Therefore, lineages with a history of genome duplication might have been especially prone to speciation via divergent resolution.
  398. Wuyts, Jan, De Rijk, P., Van de Peer, Y., Winkelmans, T., & De Wachter, R. (2001). The European Large Subunit Ribosomal RNA database. NUCLEIC ACIDS RESEARCH, 29(1), 175–177.
    The European Large Subunit Ribosomal RNA Database compiles all complete or nearly complete targe subunit ribosomal RNA sequences available from public sequence databases. These are provided in aligned format and the secondary structure, as derived by comparative sequence analysis, is included. Additional information about the sequences such as literature references and taxonomic information is also included. The database is available from our WWW server at http://rrna.ula.ac.be/lsu/.
  399. Ben Ali, A., De Baere, R., Van der Auwera, G., De Wachter, R., & Van de Peer, Y. (2001). Phylogenetic relationships among algae based on complete large-subunit rRNA sequences. INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY, 51(3), 737–749.
    The complete or nearly complete large-subunit rRNA (LSU rRNA) sequences were determined for representatives of several algal groups such as the chlorarachniophytes, cryptomonads, haptophytes, bacillariophytes, dictyochophytes and pelagophytes. Our aim was to study the phylogenetic position and relationships of the different groups of algae, and in particular to study the relationships among the different classes of heterokont algae. In LSU rRNA phylogenies, the chlorarachniophytes, cryptomonads and haptophytes seem to form independent evolutionary lineages, for which a specific relationship with any of the other eukaryotic taxa cannot be demonstrated. This is in accordance with phylogenies inferred on the basis of the small-subunit rRNA (SSU rRNA), Regarding the heterokont algae, which form a well-supported monophyletic lineage on the basis of LSU rRNA, resolution between the different classes could be improved by combining the SSU and LSU rRNA data. Based on a concatenated alignment of both molecules, the phaeophytes and the xanthophytes are sister taxa, as well as the pelagophytes and the dictyochophytes, and the chrysophytes and the eustigmatophytes. All these sister group relationships are highly supported by bootstrap analysis and by different methods of tree construction.
  400. Taylor, J. S., Van de Peer, Y., Braasch, I., & Meyer, A. (2001). Comparative genomics provides evidence for an ancient genome duplication event in fish. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 356(1414), 1661–1679.
    There are approximately 25 000 species in the division Teleostei and most arc believed to have arisen during a relatively short period of time ca. 200 Myr ago. The discovery of 'extra' Hox gene clusters in zebrafish (Danio rerio), medaka (Oryzias latipes), and pufferfish (Fugu rubripes), has led to the hypothesis that genome duplication provided the genetic raw material necessary for the telcost radiation. We identified 27 groups of orthologous genes which included one gene from man, mouse and chicken, one or two genes from tetraploid Xenopus and two genes from zebrafish. A genome duplication in the ancestor of teleost fishes is the most parsimonious explanation for the observations that for 15 of these genes, the two zebrafish orthologues are sister sequences in phylogenies that otherwise match the expected organismal tree, the zebrafish gene pairs appear to have been formed at approximately the same time, and are unlinked. Phylogenies of nine genes differ a little from the tree predicted by the fish-specific genome duplication hypothesis: one tree shows a sister sequence relationship for the zebrafish genes but differs slightly from the expected organismal tree and in eight trees, one zebrafish gene is the sister sequence to a clade which includes the second zebrafish gene and orthologues from Xenopus, chicken, mouse and man. For these nine gene trees, deviations from the predictions of the fish-specific genome duplication hypothesis are poorly supported. The two zebrafish orthologues for each of the three remaining genes are tightly linked and are, therefore, unlikely to have been formed during a genome duplication event. We estimated that the unlinked duplicated zebrafish genes are between 300 and 450 Myr. Thus, genome duplication could have provided the genetic raw material for teleost radiation. Alternatively, the loss of different duplicates in different populations (i.e. 'divergent resolution') may have promoted speciation in ancient teleost populations.
  401. Wuyts, Jan, De Rijk, P., Van de Peer, Y., Pison, G., Rousseeuw, P., & De Wachter, R. (2000). Comparative analysis of more than 3000 sequences reveals the existence of two pseudoknots in area V4 of eukaryotic small subunit ribosomal RNA. NUCLEIC ACIDS RESEARCH, 28(23), 4698–4708.
    The secondary structure of V4, the largest variable area of eukaryotic small subunit ribosomal RNA, was re-examined by comparative analysis of 3253 nucleotide sequences distributed over the animal, plant and fungal kingdoms and a diverse set of protist taxa, An extensive search for compensating base pair substitutions and for base covariation revealed that in most eukaryotes the secondary structure of the area consists of 11 helices and includes two pseudoknots, In one of the pseudoknots, exchange of base pairs between the two stems seems to occur, and covariation analysis points to the presence of a base triple. The area also contains three potential insertion points where additional hairpins or branched structures are present in a number of taxa scattered throughout the eukaryotic domain.
  402. Lockhart, P. J., Huson, D., Maier, U., Fraunholz, M. J., Van de Peer, Y., Barbrook, A. C., Howe, C. J., et al. (2000). How molecules evolve in eubacteria. MOLECULAR BIOLOGY AND EVOLUTION, 17(5), 835–838.
  403. Van de Peer, Y., Ben Ali, A., & Meyer, A. (2000). Microsporidia : accumulating molecular evidence that a group of amitochondriate and suspectedly primitive eukaryotes are just curious fungi. GENE, 246(1-2), 1–8.
  404. Van de Peer, Y., De Rijk, P., Wuyts, J., Winkelmans, T., & De Wachter, R. (2000). The European Small Subunit Ribosomal RNA database. NUCLEIC ACIDS RESEARCH, 28(1), 175–176.
    The European database of the Small Subunit (SSU) Ribosomal RNA is a curated database that strives to collect all- information about the primary and secondary structure of completely or nearly-completely sequenced rRNAs, Furthermore, the database complies additional information such as literature references and taxonomic status of the organism the sequence was derived from. The database can be consulted via the WWW at URL http://rrna.uia.ac.be/ssu/. Through the WWW, sequences can be easily selected either one by one, by taxonomic group, or by a combination of both, and can be retrieved in different sequence and alignment formats.
  405. De Rijk, P., Wuyts, J., Van de Peer, Y., Winkelmans, T., & De Wachter, R. (2000). The European Large Subunit Ribosomal RNA database. NUCLEIC ACIDS RESEARCH, 28(1), 177–178.
    The European Large Subunit (LSU) Ribosomal RNA (rRNA) database is accessible via the rRNA WWW Server at URL http://rrna.uia.ac.be/Isu/. It is a curated database that compiles complete or nearly complete LSU rRNA sequences in aligned form, and also incorporates secondary structure information for each sequence. Taxonomic information, literature references and other information about the sequences: are also available, and can be searched via the WWW interface.
  406. Van de Peer, Y., Baldauf, S. L., Doolittle, W. F., & Meyer, A. (2000). An updated and comprehensive rRNA phylogeny of (crown) eukaryotes based on rate-calibrated evolutionary distances. JOURNAL OF MOLECULAR EVOLUTION, 51(6), 565–576.
    Recent experience with molecular phylogeny has shown that all molecular markers have strengths and weaknesses. Nonetheless, despite several notable discrepancies with phylogenies obtained from protein data, the merits of the small subunit ribosomal RNA (SSU rRNA) as a molecular phylogenetic marker remain indisputable. Over the last 10 to 15 years a massive SSU rRNA database has been gathered, including more then 3000 complete sequences from eukaryotes. This creates a huge computational challenge, which is exacerbated by phenomena such as extensive rate variation among sites in the molecule. A few years ago, a fast phylogenetic method was developed that takes into account among-site rate variation in the estimation of evolutionary distances. This "substitution rate calibration" (SRC) method not only corrects for a major source of artifacts in phylogeny reconstruction but, because it is based on a distance approach, allows comprehensive trees including thousands of sequences to be constructed in a reasonable amount of time. In this study, a nucleotide variability map and a phylogenetic tree were constructed, using the SRC method, based on all available (January 2000) complete SSU rRNA sequences (2551) for species belonging to the so-called eukaryotic crown. The resulting phylogeny constitutes the most complete description of overall eukaryote diversity and relationships to date. Furthermore, branch lengths estimated with the SRC method better reflect the huge differences in evolutionary rates among and within eukaryotic lineages. The ribosomal RNA tree is compared with a recent protein phylogeny obtained from concatenated actin, alpha -tubulin, beta -tubulin, and elongation factor 1-alpha amino acid sequences. A consensus phylogeny of the eukaryotic crown based on currently available molecular data is discussed, as well as specific problems encountered in analyzing sequences when large differences in substitution rate are present, either between different sequences (rate variation among lineages) or between different positions within the same sequence (among-site rate variation).
  407. Ben Ali, A., Wuyts, J., De Wachter, R., Meyer, A., & Van de Peer, Y. (1999). Construction of a variability map for eukaryotic large subunit ribosomal RNA. NUCLEIC ACIDS RESEARCH, 27(14), 2825–2831.
    In this paper, we present a variability map of the eukaryotic large subunit ribosomal RNA, showing the distribution of variable and conserved sites in this molecule. The variability of each site in this map is indicated by means of a colored dot. Construction of the variability map was based on the substitution rate calibration (SRC) method, in which the substitution rate of each nucleotide site is computed by looking at the frequency with which sequence pairs differ at that site as a function of their evolutionary distance. Variability maps constructed by this method provide a much more accurate acid objective description of site-to-site variability than visual inspection of sequence alignments.
  408. Wenderoth, K., Marquardt, J., Fraunholz, M., Van de Peer, Y., Wastl, J., & Maier, U.-G. (1999). The taxonomic position of Chlamydomyxa labyrinthuloides. EUROPEAN JOURNAL OF PHYCOLOGY, 34(2), 97–108.
    Chlamydomyxa labyrinthuloides is a heterokont alga known since the last century. It lives on Sphagnum and other water plants as aplanospores or plasmodia. We have investigated the taxonomic position of Chlamydomyxa labyrinthuloides by combining results from morphological studies, pigment analyses and a molecular phylogenetic analysis of the small subunit (SSU) rRNA gene. Chlamydomyxa labyrinthuloides shares morphological features with xanthophytes and chrysophytes, whereas pigment composition indicates a grouping with the phaeophytes, raphidophytes and chrysophytes. The sequence of the SSU rRNA gene and its phylogenetic reconstruction unambiguously demonstrate that Chlamydomyxa labyrinthuloides is related to the chrysophytes.
  409. Andersen, R. A., Van de Peer, Y., Potter, D., Sexton, J. P., Kawachi, M., & LaJeunesse, T. (1999). Phylogenetic analysis of the SSU rRNA from members of the Chrysophyceae. PROTIST, 150(1), 71–84.
    The nucleotide sequence for the nuclear-encoded small subunit ribosomal RNA gene (SSU rRNA) was determined for 24 species of the Chrysophyceae sensu stricto. These sequences were aligned, using primary and secondary structure, with nine previously published sequences for the Chrysophyceae, 14 for the Synurophyceae, and five for the Eustigmatophyceae (outgroup), Data analyses were the substitution rate calibration distance method using neighbor-joining (TREECON), Kimura 2-parameter neighbor-joining method (PAUP) and the maximum parsimony method (PAUP, PHYLIP), Trees from the analyses were largely congruent, but bootstrap support was weak at many nodes. The analyses recovered clades of uniflagellate and biflagellate organisms associated with current higher level taxonomy (e.g., subclass, order). The genus Ochromonas was polyphyletic, and O. tuberculata in particular was distantly related to the other Ochromonas species in the analysis. The family Paraphysomonadaceae occupied a basal position in three of four analyses. The class Synurophyceae appeared to be embedded within the Chrysophyceae, but bootstrap support was weak (< 50%) in all analyses except the PHYLIP parsimony analysis (= 81%), It was considered premature to place the Synurophyceae back into the Chrysophyceae based upon the analysis of one gene, especially given the ultrastructural and pigment differences between the two groups, but the relationship of these two groups deserves further study.
  410. De Rijk, P., Robbrecht, E., de Hoog, S., Caers, A., Van de Peer, Y., & De Wachter, R. (1999). Database on the structure of large subunit ribosomal RNA. NUCLEIC ACIDS RESEARCH, 27(1), 174–178.
    The Antwerp database on targe subunit ribosomal RNA now contains 607 complete or nearly complete aligned sequences. The alignment incorporates secondary structure information for each sequence. Other information about the sequences, such as literature references, accession numbers and taxonomic information is also available. Information from the database can be downloaded or searched on the rRNA WWW Server at URL http ://rrna.uia.ac.be/.
  411. Van de Peer, Y., Robbrecht, E., de Hoog, S., Caers, A., De Rijk, P., & De Wachter, R. (1999). Database on the structure of small subunit ribosomal RNA. NUCLEIC ACIDS RESEARCH, 27(1), 179–183.
    Over 11 500 complete or nearly complete sequences are now available from the Antwerp database on small subunit ribosomal RNA. All these sequences are aligned with one another on the basis of the adopted secondary structure model, which is corroborated by the observation of compensating substitutions in the alignment. Literature references, accession numbers and taxonomic information are also compiled. The database can be consulted via the World Wide Web at URL http://rrna.uia.ac.be/ssu/.
  412. Van de Peer, Y. (1999). Molecular evolution and the incorporation of site-to-site rate variation in distance tree construction methods. BELGIAN JOURNAL OF ZOOLOGY, 129(1), 5–15. Presented at the 5th Benelux congress of Zoology.
    The construction of evolutionary trees based on sequence data is not self-evident Apart from the plethora of methods and software tools to choose from if one wants to infer phylogenetic tree topologies, one has also to be cautious about the sequence data themselves. In this paper, we discuss how systematic errors can be introduced by one of the phenomena that often characterize sequence data, i.e. differences in substitution rates among the different sites of the molecule. Regarding painwise distance methods, these systematic errors can often be avoided if an appropriate substitution model is applied to the construction of phylogenetic trees. This is demonstrated for a phylogeny based on animal small subunit ribosomal RNA sequences.
  413. Raes, Jeroen, & Van de Peer, Y. (1999). ForCon : a software tool for the conversion of sequence alignments. EMBNET NEWS, 6(1), 10–12.
    ForCon is a software tool for the conversion of nucleic acid and amino acid sequence alignments that runs on IBMcompatible computers under a Microsoft Windows environment.The program converts alignment formats used by all popular software packages for sequence alignment and phylogenetic tree inference.ForCon is available for free on request from the authors or can be downloaded via internet at URL http://bioc-www.uia.ac.be/u/jraes/ index.html .It is also included in the software package TREECON for Windows (see http://bioc-www.uia.ac.be/u/ yvdp/index.html).
  414. Vandamme, P., Segers, P., Ryll, M., Hommez, J., Vancanneyt, M., Coopman, R., De Baere, R., et al. (1998). Pelistega europaea gen. nov., sp. nov., a bacterium associated with respiratory disease in pigeons: taxonomic structure and phylogenetic allocation. INTERNATIONAL JOURNAL OF SYSTEMATIC BACTERIOLOGY, 48(2), 431–440.
    Twenty-four strains isolated mainly from infected respiratory tracts of pigeons were characterized by an integrated genotypic and phenotypic approach. An extensive biochemical examination using conventional tests and several API microtest systems indicated that all isolates formed a phenotypically homogeneous taxon with a DNA G+C content between 42 and 43 mol%. Whole-cell protein and fatty acid analysis revealed an unexpected heterogeneity which was confirmed by DNA-DNA hybridizations. Four main genotypic sub-groups (genomovars) were delineated. 16S rDNA sequence analysis of a representative strain indicated that this taxon belongs to the beta-subclass of the Proteobacteria with Taylorella equigenitalis as its closest neighbour (about 94.8 % similarity). A comparison of phenotypic and genotypic characteristics of both taxa suggested that the pigeon isolates represented a novel genus for which the name Pelistega is proposed. In the absence of differential phenotypic characteristics between the genomovars, it was preferred to include all of the isolates into a single species, Pelistega europaea, and strain LMG 10982 was selected as the type strain. The latter strain belongs to fatty acid cluster I and protein electrophoretic sub-group 1, which comprise 13 and 5 isolates, respectively. It is not unlikely that the name P. europaea will be restricted in the future to organisms belonging to fatty acid cluster I, or even to protein electrophoretic sub-group 1, upon discovery of differential diagnostic features.
  415. Winnepenninckx, B. M., Van de Peer, Y., & Backeljau, T. (1998). Metazoan relationships on the basis of 18S rRNA sequences : a few years later... AMERICAN ZOOLOGIST, 38(6), 888–906. Presented at the Symposium on Evolutionary Relationships of Metazoan Phyla ; held at the Annual meeting of the Society for Integrative and Comparative Biology.
    The 18S rRNA database is continuously growing and new tree construction methods are being developed, The present study aims at assessing what effect the addition of recently determined animal 18S rRNA sequences and the use of a recently developed distance matrix calculation method have on the results of some previously published case studies on metazoan phylogeny. When re-assessing three case studies, part of their conclusions was confirmed on the basis of the present 18S rRNA data set: 1) the monophyly of Arthropoda; 2) the monophyly of the Vestimentifera-Pogonophora and their protostome character; 3) the doubt about the monophyletic origin of Echiura-Sipuncula and 4) the coelomate character of Nemertea, Yet, it is also demonstrated that some of the previous results are no longer warranted when updating the analyses: 1) the monophyly of both the Annelida and the Eutrochozoa; 2) the sister-group relationship of Echiura to Vestimentifera-Pogonophora and 3) the polyphyly of the Mesozoa and their close relationship to Myxozoa and Nematodes, In addition, some new very preliminary evidence is provided for: 1) a common ancestry of Platyhelminthes and Mesozoa and the monophyly of the latter group and 2) the monophyly of Clitellata, Hirudinida and Oligochaeta. Finally, doubt is casted on the monophyly of the Polychaeta and the polychaete orders Spionida, Phyllodocida, and Sabellidae, Of course, these hypotheses also need further testing.
  416. Zwart, G., Huismans, R., van Agterveld, M. P., Van de Peer, Y., De Rijk, P., Eenhoorn, H., Muyzer, G., et al. (1998). Divergent members of the bacterial division Verrucomicrobiales in a temperate freshwater lake. FEMS MICROBIOLOGY ECOLOGY, 25(2), 159–169.
  417. Van de Peer, Y., Caers, A., De Rijk, P., & De Wachter, R. (1998). Database on the structure of small ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 26(1), 179–182.
    About 8600 complete or nearly complete sequences are now available from the Antwerp database on small ribosomal subunit RNA. All these sequences are aligned with one another on the basis of the adopted secondary structure model, which is corroborated by the observation of compensating substitutions in the alignment. Literature references, accession numbers and detailed taxonomic information are also compiled, The database can be consulted via the World Wide Web at URL http://rrna.uia.ac.be/ssu/.
  418. De Rijk, P., Caers, A., Van de Peer, Y., & De Wachter, R. (1998). Database on the structure of large ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 26(1), 183–186.
    The rRNA WWW Server at URL http://rrna.uia.ac.be/ now provides a database of 496 large subunit ribosomal RNA sequences. All these sequences are aligned, incorporate secondary structure information, and can be obtained in a number of formats, Other information about the sequences, such as literature references, accession numbers and taxonomic information is also available and searchable, If necessary, the data on the server can also be obtained by anonymous ftp.
  419. Rensing, S. A., Obrdlik, P., Rober-Kleber, N., Müller, S. B., Hofmann, C. J., Van de Peer, Y., & Maier, U.-G. (1997). Molecular phylogeny of the stress-70 protein family with reference to algal relationships. EUROPEAN JOURNAL OF PHYCOLOGY, 32(3), 279–285. Presented at the 1st European Phycological congress.
    The stress-70 protein family has previously been shown to be a useful tool for molecular phylogeny at the kingdom to family levels. Although sequences of many members of the stress-70 family are available, few genes from the Protoctista have been sequenced to date. Phylogenetic analyses of algae based on various molecules have not, as yet, provided dear results concerning relationships between major divisions. We cloned and sequenced several algal stress-70 genes in order to provide additional data and to further analyse phylogenetic relationships among algal divisions. New nuclear sequences were obtained from Guillardia theta (Cryptophyta), Ascophyllum nodosum (Heterokontophyta) and Cyanophora paradoxa (Glaucocystophyta). Phylogenetic trees of the stress-70 protein family calculated using different methods are presented. In our trees, the heterokont alga Ascophyllum nodosum is closely related to the slime mould Dictyostelium discoideum, while the nucleomorph (eukaryotic endosymbiont) of the cryptophyte Rhodomonas salina seems to be related to the chlorobiont lineage. The glaucocystophyte Cyanophora paradoxa and the nuclear sequence (host) of the cryptomonad alga Guillardia theta also seem to be closely related. The Cryptophyta and the heterokont algae have evolved from different secondary endosymbiotic events involving different hosts and probably different endosymbionts. However, until more stress-70 sequences of algal divisions become available no definitive conclusions can be drawn concerning branching of the major divisions.
  420. Capesius, I., & Van de Peer, Y. (1997). Secundary structure of the large ribosomal subunit RNA of the moss Funaria hygrometrica. JOURNAL OF PLANT PHYSIOLOGY, 151(2), 239–241.
    In this study, the complete nucleotide sequence of the large ribosomal subunit RNA of the bryophyte Funaria hygrometrica was determined. The RNA sequence, which is the first one reported for bryophytes, was used to infer a secondary structure model. It delivers the base for further evolutionary studies in this group.
  421. Van de Peer, Y., & De Wachter, R. (1997). Construction of evolutionary distance trees with TREECON for Windows : accounting for variation in nucleotide substitution rate among sites. COMPUTER APPLICATIONS IN THE BIOSCIENCES, 13(3), 227–230.
    Motivation: To improve the estimation of evolutionary distances between nucleotide sequences by considering the differences in substitution rates among sires. Results: TREECON for Windows (Van de Peer, Y. and De Wachter, R. Comput. Applic. Biosci., 9, 569-570, 1994) is a software package for the construction and drawing of phylogenetic trees based on distance data computed from nucleic acid and amino acid sequences. For nucleic acids, we here describe the implementation of a recently developed method for estimating evolutionary distances taking into account the substitution rate of individual sites in a sequence alignment. Availability: TREECON for Windows is available on request from the authors. A small fee is asked in order to support the work and to reinvest in new computer hard- and software. More information about the program and substitution rate calibration can be found at URL http://bioc-www.uia.ac.be/u/yvdp/treeconw.html.
  422. Král’ová, I., Van de Peer, Y., Jirků, M., Van Ranst, M., Scholz, T., & Lukeš, J. (1997). Phylogenetic analysis of a fish tapeworm, Proteocephalus exiguus, based on the small subunit rRNA gene. MOLECULAR AND BIOCHEMICAL PARASITOLOGY, 84(2), 263–266.
  423. Van de Peer, Y., Jansen, J., De Rijk, P., & De Wachter, R. (1997). Database on the structure of small ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 25(1), 111–116.
    The Antwerp database on small ribosomal subunit RNA now offers more than 6000 nucleotide sequences (August 1996). All these sequences are stored in the form of an alignment based on the adopted secondary structure model, which is corroborated by the observation of compensating substitutions in the alignment, Besides the primary and secondary structure information, literature references, accession numbers and detailed taxonomic information are also compiled, For ease of use, the complete database is made available to the scientific community via World Wide Web at URL http://rrna.uia.ac.be/ssu/.
  424. De Rijk, P., Van de Peer, Y., & De Wachter, R. (1997). Database on the structure of large ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 25(1), 117–122.
    The latest release of the large ribosomal subunit RNA database contains 429 sequences. All these sequences are aligned, and incorporate secondary structure information. The rRNA WWW Sewer at URL http://rrna.uia.ac.be/ provides researchers with an easily accessible resource to obtain the data in this database in a number of computer-readable formats. A new query interface has been added to the server. If necessary, the data can also be obtained by anonymous ftp from the same site.
  425. Van de Peer, Y., & De Wachter, R. (1997). Evolutionary relationships among the eukaryotic crown taxa taking into account site-to-site rate variation in 18S rRNA. JOURNAL OF MOLECULAR EVOLUTION, 45(6), 619–630.
    In this study we constructed a bootstrapped distance tree of 500 small subunit ribosomal RNA sequences from organisms belonging to the so-called crown of eukaryote evolution. Taking into account the substitution rate of the individual nucleotides of the rRNA sequence alignment, our results suggest that (1) animals, true fungi, and choanoflagellates share a common origin: The branch joining these taxa is highly supported by bootstrap analysis (bootstrap support [BS] 1 90%), (2) stramenopiles and alveolates are sister groups (BS = 75%), (3) within the alveolates, dinoflagellates and apicomplexans share a common ancestor BS > 95%), while in turn they both share a common origin with the ciliates (BS > 80%), and (4) within the stramenopiles, heterokont algae, hyphochytriomycetes, and oomycetes form a monophyletic grouping well supported by bootstrap analysis (BS > 85%), preceded by the well-supported successive divergence of labyrinthulomycetes and bicosoecids. On the other hand, many evolutionary relationships between crown taxa are still obscure on the basis of 18S rRNA. The branching order between the animal-fungal-choanoflagellates clade and the chlorobionts, the alveolates and stramenopiles, red algae, and several smaller groups of organisms remains largely unresolved. When among-site rate variation is not considered, the inferred tree topologies are inferior to those where the substitution rate spectrum for the 18S rRNA is taken into account. This is primarily indicated by the erroneous branching of fast-evolving sequences. Moreover, when different substitution rates among sites are not considered, the animals no longer appear as a monophyletic grouping in most distance trees.
  426. Moens, L., Vanfleteren, J., Van de Peer, Y., Peeters, K., Kapp, O., Czeluzniak, J., Goodman, M., et al. (1996). Globins in nonvertebrate species: dispersal by horizontal gene transfer and evolution of the structure-function relationships. MOLECULAR BIOLOGY AND EVOLUTION, 13(2), 324–333.
    Using a new template based on an alignment of 145 nonvertebrate globins we examined several recently determined sequences of putative globins and globin-like hemeproteins. We propose that all globins have evolved from a family of ancestral, approx. 17-kDa hemeproteins, which displayed the globin fold and functioned as redox proteins. Once atmospheric O-2 became available the acquisition of oxygen-binding properties was initiated, culminating in the various highly specialized functions known at present. During this evolutionary process, we suggest that (1) high oxygen affinity may have been acquired repeatedly and (2) the formation of chimeric proteins containing both a globin and a flavin binding domain was an additional and distinct evolutionary trend. Furthermore, globin-like hemeproteins encompass hemeproteins produced through convergent evolution from nonglobin ancestral proteins to carry out O-2-binding functions as well as hemeproteins whose sequences exhibit the loss of some or all of the structural determinants of the globin fold. We also propose that there occurred two cases of horizontal globin gene transfer, one from an ancestor common to the ciliates Paramecium and Tetrahymena and the green alga Chlamydomonas to a cyanobacterium ancestor and the other, from a eukaryote ancestor of the yeasts Saccharomyces and Candida to a bacterial ancestor of the proteobacterial genera Escherichia, Alcaligenes, and Vitreoscilla.
  427. Van de Peer, Y., Vancanneyt, M., & De Wachter, R. (1996). Compilation of pseudomonad sequences present in a database on the structure of ribosomal RNA. SYSTEMATIC AND APPLIED MICROBIOLOGY, 19(4), 493–500.
    The ribosomal RNA database in Antwerp (Belgium) offers extensive alignments of both small and large ribosomal subunit RNA (SSU/LSU rRNA) sequences. In July 1996, the SSU rRNA and LSU rRNA sequence alignments comprised respectively about 6400 and 350 sequences. The alignments are based on the secondary structure models adopted for both molecules, which are corroborated by the observation of compensating mutations. Literature references, accession numbers, and detailed taxonomic information are also compiled. Since part of this issue of Systematic and Applied Microbiology is dedicated to the pseudomonads in particular, all SSU rRNA and LSU rRNA sequences determined for these bacteria and available in the Antwerp databases are listed. The complete databases are accessible to the scientific community through anonymous ftp and World Wide Web. Our server also provides software for sequence alignment and phylogenetic tree construction.
  428. Moore, E. R., Mau, M., Arnscheidt, A., Böttger, E. C., Hutson, R. A., Collins, M. D., Van de Peer, Y., et al. (1996). The determination and comparison of the 16S rRNA gene sequences of species of the genus Pseudomonas (sensu stricto) and estimation of the natural intrageneric relationships. SYSTEMATIC AND APPLIED MICROBIOLOGY, 19(4), 478–492.
    As a consolidated effort on the part of several laboratories, partial and nearly complete sequence determinations of 16S rRNA genes have been applied as one of several analytical methods in a polyphasic study of the pseudomonads. Nearly-complete sequences have been determined of the PCR-amplified 16S rRNA genes of 21 species of the genus Pseudomonas (sensu stricto), including multiple attains of most species. Phylogenetic branching orders and the natural intrageneric relationships among the species have been infrared through sequence comparisons and cluster analysis and have not shown any obvious recognizable correlation with results derived through standard phenotypic criteria commonly used to group ;he species. This paper also focuses on the ability of 16S rRNA gene sequences, particularly the hypervariable sequence regions, to be used as nested identification markers and as target sites for the development of 16S rRNA sequence-based strategies for the identification of species of the genus Pseudomonas.
  429. Van de Peer, Y., Janssens, W., Heyndrickx, L., Fransen, K., van der Groen, G., & De Wachter, R. (1996). Phylogenetic analysis of the env gene of HIV-1 isolates taking into account individual nucleotide substitution rates. AIDS, 10(13), 1485–1494.
    Objective: To estimate the relative substitution rate of the individual positions in an alignment of HIV-1 env sequences coding for areas V3, V4, V5, and the beginning of gp41, and to study phylogenetic relationships between HIV-1 strains taking into account these substitution rate estimates. Design: Phylogenetic comparison of 145 HIV-1 strains classified in HIV-1 group M, subtypes A-H and isolated from patients of 24 different geographical origins. Methods: A new method recently developed for measuring the substitution rates of the individual nucleotides in a sequence alignment was applied to an alignment of env gene sequences. From the resulting substitution rate distribution, an equation was derived that describes the relationship between dissimilarity and evolutionary distance better than equations previously available. Phylogenetic trees were then constructed from matrices of distances computed using this new equation. Results: 'Substitution rate calibration' offers detailed information on the relative substitution rate or variability of the nucleotides in the env gene. A large phylogenetic tree of 145 env gene sequences constructed by neighbour-joining and taking into account the substitution rate spectrum for this gene, clearly shows the existence of the eight subtypes A-H, all supported at a bootstrap level of 90% or higher. Intersubtype distances were between 0.25 and 0.38, which is considerably higher than those found in trees not considering differences in substitution rates among different alignment positions. Conclusions: Evolutionary distances are seriously underestimated when individual substitution rates are not considered in the estimation of evolutionary distances. Furthermore, due to the more accurate estimation of evolutionary distances, naturally occurring HIV-1 intersubtype recombinants could be recognized more easily.
  430. Chalwatzis, N., Hauf, J., Van de Peer, Y., Kinzelbach, R., & Zimmermann, F. K. (1996). 18S Ribosomal RNA genes of insects : primary structure of the genes and molecular phylogeny of the Holometabola. ANNALS OF THE ENTOMOLOGICAL SOCIETY OF AMERICA, 89(6), 788–803.
    The 18S ribosomal RNA genes of 19 insect species, including 12 Holometabola, were cloned and sequenced. The genes of the insect species investigated so far vary in length from 1,809 to 3,316 bp. The genes were aligned according to the secondary structure model of the 18S ribosomal RNA. An average of 1,580 aligned nucleotide positions per gene was used for the calculation of phylogenetic trees with sequences of this and previous studies. Neighbor-joining trees were calculated with gamma, substitution rate calibration, and LogDet distances. Informative alignment positions were used for a maximum parsimony analysis. The robustness of the trees was tested by bootstrapping and by branch support calculations. All the major groups of the Holometabola, which are commonly regarded as monophyletic except for the Mecoptera, were supported with the neighbor-joining analysis. Some of these groups were not represented in the most parsimonious trees or had a bootstrap support of <80%. However, a sister group relationship of the Strepsiptera and the Diptera was found with both methods and with corrections for different substitution rates at different sites and for differences in the nucleotide composition.
  431. Van de Peer, Y., Chapelle, S., & De Wachter, R. (1996). A quantitative map of nucleotide substitution rates in bacterial rRNA. NUCLEIC ACIDS RESEARCH, 24(17), 3381–3391.
    A recently developed method for estimating the variability of nucleotide sites in a sequence alignment [Van de Peer, Y., Van der Auwera, G, and,De Wachter, R. (1996) J. Mel. Evol. 42, 201-210] was applied to bacterial 16S, 5S and 23S rRNAs, In this method, the variability of each nucleotide site is defined as its evolutionary rate relative to the average evolutionary rate of all the nucleotide sites of the molecule. Spectra of evolutionary rates were calculated for each rRNA and show the fastest evolving sites substituting at rates more than 1000 times that of the slowest ones. Variability maps are presented for each rRNA, consisting of secondary structure models where the variability of each nucleotide site is indicated by means of a colored dot. The maps can be interpreted in terms of higher order structure, function and evolution of the molecules and facilitate the selection of areas suitable for the design of PCR primers and hybridization probes. Variability measurement is also important for the precise estimation of evolutionary distances and the inference of phylogenetic trees.
  432. Van de Peer, Y., Rensing, S. A., Maier, U.-G., & De Wachter, R. (1996). Substitution rate calibration of small subunit ribosomal RNA identifies chlorarachniophyte endosymbionts as remnants of green algae. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 93(15), 7732–7736.
    Chlorarachniophytes are amoeboid algae with chlorophyll a and b containing plastids that are surrounded by four membranes instead of two as in plants and green algae. These extra membranes form important support for the hypothesis that chlorarachniophytes have acquired their plastids by the ingestion of another eukaryotic plastid-containing alga. Chlorarachniophytes also contain a small nucleus-like structure called the nucleomorph situated between the two inner and the two outer membranes surrounding the plastid. This nucleomorph is a remnant of the endosymbiont's nucleus and encodes, among other molecules, small subunit ribosomal RNA. Previous phylogenetic analyses on the basis of this molecule provided unexpected and contradictory evidence for the origin of the chlorarachniophyte endosymbiont. We developed a new method for measuring the substitution rates of the individual nucleotides of small subunit ribosomal RNA. From the resulting substitution rate distribution. We derived an equation that gives a more realistic relationship between sequence dissimilarity and evolutionary distance than equations previously available. Phylogenetic trees constructed on the basis of evolutionary distances computed by this new method clearly situate the chlorarachniophyte nucleomorphs among the green algae. Moreover, this relationship is confirmed by transversion analysis of the Chlorarachnion plastid small subunit ribosomal RNA.
  433. Van de Peer, Y., Van der Auwera, G., & De Wachter, R. (1996). The evolution of stramenopiles and alveolates as derived by “substitution rate calibration” of small ribosomal subunit RNA. JOURNAL OF MOLECULAR EVOLUTION, 42(2), 201–210.
    The substitution rate of the individual positions in an alignment of 750 eukaryotic small ribosomal subunit RNA sequences was estimated. From the resulting rate distribution, an equation was derived that gives a more precise relationship between sequence dissimilarity and evolutionary distance than hitherto available. Trees constructed on the basis of evolutionary distances computed by this new equation for small ribosomal subunit RNA sequences from ciliates, apicomplexans, dinoflagellates, oomycetes, hyphochytriomycetes, bicosoecids, labyrinthuloids, and heterokont algae show a more consistent tree topology than trees constructed in the absence of ''substitution rate calibration.'' In particular, they do not suffer from anomalies caused by the presence of extremely long branches.
  434. Van de Peer, Y., & De Wachter, R. (1996). Substitution rate calibration of nucleotide sequence alignments and application to phylogenetic tree construction. ARCHIVES OF PHYSIOLOGY AND BIOCHEMISTRY, 104(3), B53–B53. Presented at the 162nd Annual meeting of the Société Belge de Biochimie et de Biologie Moléculaire/Belgische Vereniging voor Biochemie en Moleculaire Biologie.
  435. Van de Peer, Y., Nicolaï, S., De Rijk, P., & De Wachter, R. (1996). Database on the structure of small ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 24(1), 86–91.
    The Antwerp database on small ribosomal subunit RNA offers over 4300 nucleotide sequences (August 1995). All these sequences are stored in the form of an alignment based on the adopted secondary structure model, which in turn is corroborated by the observation of compensating substitutions in the alignment. Besides the primary and secondary structure information, literature references, accession numbers and detailed taxonomic information are also compiled. The complete database is made available to the scientific community through anonymous ftp and World Wide Web (WWW).
  436. De Rijk, P., Van de Peer, Y., & De Wachter, R. (1996). Database on the structure of large ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 24(1), 92–97.
    Our database on large ribosomal subunit RNA contained 334 sequences in July, 1995. All sequences in the database are aligned, taking into account secondary structure. The aligned sequences are provided, together with incorporated secondary structure information, in several computer-readable formats. These data can easily be obtained through the World Wide Web. The files in the database are also available via anonymous ftp.
  437. Nelissen, B., Van de Peer, Y., Wilmotte, A., & De Wachter, R. (1995). An early origin of plastids within the cyanobacterial divergence is suggested by evolutionary trees based on complete 16S rRNA sequences. MOLECULAR BIOLOGY AND EVOLUTION, 12(6), 1166–1173.
    It is generally accepted that the plastids arose from a cyanobacterial ancestor, but the exact phylogenetic relationships between cyanobacteria and plastids are still controversial. Most studies based on partial 16S rRNA sequences suggested a relatively late origin of plastids within the cyanobacterial divergence. In order to clarify the exact relationship and divergence order of cyanobacteria and plastids, we studied their phylogeny on the basis of nearly complete 16S rRNA gene sequences. The data set comprised 15 strains of cyanobacteria from different morphological groups, 1 prochlorophyte, and plastids belonging to 8 species of plants and 12 species of diverse algae. This set included three cyanobacterial sequences determined in this study. This is the most comprehensive set of complete cyanobacterial and plastidial 16S rRNA sequences used so far. Phylogenetic trees were constructed using neighbor joining and maximum parsimony, and the reliability of the tree topologies was tested by different methods. Our results suggest an early origin of plastids within the cyanobacterial divergence, preceded only by the divergence of two cyanobacterial genera, Gloeobacter and Pseudanabaena.
  438. De Rijk, P., Van de Peer, Y., Van den Broeck, I., & De Wachter, R. (1995). Evolution according to large ribosomal subunit RNA. JOURNAL OF MOLECULAR EVOLUTION, 41(3), 366–375.
    Evolutionary trees were constructed, by distance methods, from an alignment of 225 complete large subunit (LSU) rRNA sequences, representing Eucarya, Archaea, Bacteria, plastids, and mitochondria. A comparison was made with trees based on sets of small subunit (SSU) rRNA sequences. Trees constructed on the set of 172 species and organelles for which the sequences of both molecules are known had a very similar topology, at least with respect to the divergence order of large taxa such as the eukaryotic kingdoms and the bacterial divisions. However, since there are more than ten times as many SSU as LSU rRNA sequences, it is possible to select many SSU rRNA sequence sets of equivalent size but different species composition. The topologies of these trees showed considerable differences according to the particular species set selected. The effect of the dataset and of different distance correction methods on tree topology was tested for both LSU and SSU rRNA by repetitive random sampling of a single species from each large taxon. The impact of the species set on the topology of the resulting consensus trees is much lower using LSU than using SSU rRNA. This might imply that LSU rRNA is a better molecule for studying wide-range relationships. The mitochondria behave clearly as a monophyletic group, clustering with the Proteobacteria. Gram-positive bacteria appear as two distinct groups, which are found clustered together in very few cases. Archaea behave as if monophyletic in most cases, but with a low confidence.
  439. Van de Peer, Y., & De Wachter, R. (1995). Investigation of fungal phylogeny on the basis of small ribosomal subunit RNA sequences. In A. D. Akkermans, J. D. Van Elsas, & F. J. De Bruijn (Eds.), Molecular microbial ecology manual (pp. 297–308). Dordrecht, The Netherlands: Kluwer Academic.
  440. Haase, G., Sonntag, L., Van de Peer, Y., Uijthof, J. M., Podbielski, A., & Melzer-Krick, B. (1995). Phylogenetic analysis of ten black yeast species using nuclear small subunit rRNA gene sequences. ANTONIE VAN LEEUWENHOEK INTERNATIONAL JOURNAL OF GENERAL AND MOLECULAR MICROBIOLOGY, 68(1), 19–33.
    The nuclear small subunit rRNA genes of authentic strains of the black yeasts Exophiala dermatitidis, Wangiella dermatitidis, Sarcinomyces phaeomuriformis, Capronia mansonii, Nadsoniella nigra var. hesuelica, Phaeoannellomyces elegans, Phaeococcomyces exophialae, Exophiala jeanselmei var. jeanselmei and E. castellanii were amplified by PCR and directly sequenced. A putative secondary structure of the nuclear small subunit rRNA of Exophiala dermatitidis was predicted from the sequence data. Alignment with corresponding sequences from Neurospora crassa and Aureobasidium pullulans was performed and a phylogenetic tree was constructed using the neighbor-joining method. The obtained topology of the tree was confirmed by bootstrap analysis. Based upon this analysis all fungi studied formed a well-supported monophyletic group clustering as a sister group to one group of the Plectomycetes (Trichocomaceae and Onygenales). The analysis confirmed the close relationship postulated between Exophiala dermatitidis, Wangiella dermatitidis and Sarcinomyces phaeomuriformis. This monophyletic clade also contains the teleomorph species Capronia mansonii thus confirming the concept of a teleomorph connection of the genus Exophiala to a member of the Herpotrichiellaceae. However, Exophiala castellanii did not belong to this clade. Therefore, this species is not the anamorph of Capronia mansonii as it was postulated.
  441. Winnepenninckx, B., Van de Peer, Y., Backeljau, T., & De Wachter, R. (1995). CARD : a drawing tool for RNA secondary structure models. BIOTECHNIQUES, 18(6), 1060–1063.
    A graphical editor was developed to create publication-quality representations of RNA secondary structure models. A user-defined model can be gradually assembled from structural elements such as helix segments and loops. The type of structural element to be drawn is chosen from a menu. Its nucleotide sequence has to be entered from the keyboard. Afterwards, drawings can be manipulated by moving, deleting, rotating, copying or changing the structural elements separately. An example of a secondary structure model is given for a complete 18S rRNA molecule.
  442. Van der Auwera, G., De Baere, R., Van de Peer, Y., De Rijk, P., Van den Broeck, I., & De Wachter, R. (1995). The phylogeny of the Hyphochytriomycota as deduced from ribosomal RNA sequences of Hyphochytrium catenoides. MOLECULAR BIOLOGY AND EVOLUTION, 12(4), 671–678.
    Based on biochemical and ultrastructural data, hyphochytriomycetes are believed to share an ancestor with oomycetes and heterokont algae. In order to study the phylogeny of the hyphochytriomycetes, we determined both the small- and large-subunit ribosomal RNA sequence of Hyphochytrium catenoides. Phylogenetic trees were constructed using the neighbor-joining and maximum-parsimony method and include representatives of Chlorobionta, Fungi, Metazoa, Alveolata, and all known Heterokonta. Our main conclusion is that the hyphochytriomycetes form a monophyletic group with the oomycetes and heterokont algae and that they are probably the closest relatives of the oomycetes. However, the order of divergence between the various heterokont algal phyla and the oomycete-hyphochytriomycete cluster remains uncertain.
  443. Van de Peer, Y., Neefs, J.-M., De Rijk, P., De Vos, P., & De Wachter, R. (1994). About the order of divergence of the major bacterial taxa during evolution. SYSTEMATIC AND APPLIED MICROBIOLOGY, 17(1), 32–38.
    An evolutionary tree, reconstructed from 1232 bacterial small ribosomal subunit RNA sequences by a distance method, reflects the existence of 11 divisions and a number of subdivisions originally recognized by Woese and collaborators. However, the order of divergence that gave rise to these taxa remains indeterminate and the division of Gram positives and relatives does not behave as a monophyletic taxon. Analysis of the data by a novel approach led to a preferred order of divergence for 10 out of 16 tree nodes, but the Gram positives still behaved as biphyletic.
  444. Vanfleteren, Jacques, Van de Peer, Y., Blaxter, M. L., Tweedie, S. A., Trotman, C., Lu, L., Van Hauwaert, M.-L., et al. (1994). Molecular genealogy of some nematode taxa as based on cytochrome c and globin amino acid sequences. MOLECULAR PHYLOGENETICS AND EVOLUTION, 3(2), 92–101.
  445. Van de Peer, Y., & De Wachter, R. (1994). TREECON for Windows : a software package for the construction and drawing of evolutionary trees for the Microsoft Windows environment. COMPUTER APPLICATIONS IN THE BIOSCIENCES.
  446. De Rijk, P., Van de Peer, Y., Chapelle, S., & De Wachter, R. (1994). Database on the structure of large ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 22(17), 3495–3501.
    A database on large ribosomal subunit RNA is made available. It contains 258 sequences. It provides sequence, alignment and secondary structure information in computer-readable formats. Files can be obtained using ftp.
  447. Janssens, Wouter, Heyndrickx, L., Van de Peer, Y., Bouckaert, A., Fransen, K., Motte, J., Gershy-Damet, G.-M., et al. (1994). Molecular phylogeny of part of the env gene of HIV-1 strains isolated in Côte d’Ivoire. AIDS, 8(1), 21–26.
    Objectives: To examine the genetic variation of HIV-1 isolates in Abidjan, Cote d'Ivoire, and to determine the extent to which phylogenetic trees based on sequence information of part of the env gene containing the principal neutralizing domain are representative for documenting genetic variability. Design: Phylogenetic comparison of 13 HIV-1 strains isolated from patients in Abidjan with previously documented HIV-1 strains of different geographic origin. Methods: To sequence a 900 base-pair fragment of the env gene containing V3, V4, V5 and the beginning of gp41 of three to four clones per isolate. Phylogenetic tree analysis was performed with the software package TREECON. Results: Eleven HIV-1 isolates of Abidjan were classified as genotype A, while two were classed as genotypes B and D. Intra-genotype A distances at the nucleotide level were a maximum of 14.1%. Inter-genotype distances between genotype A and genotypes B, C, and D varied from 16.0 to 22.6%. Phylogenetic trees, based on sequence data of a 300 base-pair fragment containing the V3 loop, showed significant differences in tree topology and statistical confidence with phylogenetic trees based on sequence data of the 900 base-pair env fragment. Conclusions: Genotype A Cote d'Ivoire HIV-1 strains, which comprise 11 out of 13 isolates, predominate in Abidjan, which may indicate a local burst of particular variants. Phylogenetic trees should be interpreted with caution when based on a more limited number of nucleotides, such as the V3 region.
  448. Van de Peer, Y., Van den Broeck, I., De Rijk, P., & De Wachter, R. (1994). Database on the structure of small ribosomal subunit RNA. NUCLEIC ACIDS RESEARCH, 22(17), 3488–3494.
    The database on small ribosomal subunit RNA structure contains (June 1994) 2824 nucleotide sequences. All these sequences are stored in the form of an alignment based on the adopted secondary structure model, which in turn is corroborated by the observation of compensating substitutions in the alignment. The complete database is made available to the scientific community through anonymous ftp on our server in Antwerp. A special effort was made to improve electronic retrieval and a program is supplied that allows to create different fire formats. The database can also be obtained from the EMBL nucleotide sequence library.
  449. Van Camp, G., Van de Peer, Y., Nicolai, S., Neefs, J.-M., Vandamme, P., & De Wachter, R. (1993). Structure of 16S and 23S ribosomal RNA genes in Campylobacter species : phylogenetic analysis of the genus Campylobacter and presence of internal transcribed spacers. SYSTEMATIC AND APPLIED MICROBIOLOGY, 16(3), 361–368.
    16S and 23S ribosomal RNA (rRNA) genes from campylobacteria were studied by polymerase chain reaction and DNA sequence analysis using universally conserved oligonucleotide primers. In the 16S rRNA gene sequences of all C. sputorum strains tested, an insertion oi about 250 bp was found that was not present in the 16S rRNA genes of other Campylobacter species. This insertion was not present on the rRNA level in C. sputorum, and the 16S rRNA was found to be fragmented in this organism. From the length of the fragments, it could be concluded that the insertion is an internal transcribed spacer, which is probably excised during rRNA maturation. Similar internal transcribed spacers were also found in the 23S rRNA genes from several Campylobacter strains. On the basis of partial 23S rRNA gene sequences about 875 bp in length and comprising some of the most variable helices, phylogenetic analysis was performed on 17 Campylobacter strains. The results of this analysis were compared to a phylogenetic tree based on complete 16S rRNA sequences.
  450. Van de Peer, Y., & De Wachter, R. (1993). TREECON : a software package for the construction and drawing of evolutionary trees. COMPUTER APPLICATIONS IN THE BIOSCIENCES, 9(2), 177–182.
    A package of programs (run by a management program called TREECON) was developed for the construction and drawing of evolutionary trees. The program MATRIX calculates dissimilarity values and can perform bootstrap analysis on nucleic acid sequences. TREE implements different evolutionary tree constructing methods based on distance matrices. Because some of these methods produce unrooted evolutionary trees, a program ROOT places a root on the tree. Finally, the program DRAW draws the evolutionary tree, changes its size or topology, and produces drawings suitable for publication. Whereas MATRIX is suited only for nucleic acids, the modules TREE, ROOT and DRAW are applicable to any kind of dissimilarity matrix. The programs run on IBM-compatible microcomputers using the DOS operating system.
  451. Wilmotte, A., Van de Peer, Y., Goris, A., Chapelle, S., De Baere, R., Nelissen, B., Neefs, J.-M., et al. (1993). Evolutionary relationships among higher fungi inferred from small ribosomal subunit RNA sequence analysis. SYSTEMATIC AND APPLIED MICROBIOLOGY, 16(3), 436–444.
    The primary structure of the small ribosomal subunit RNA. (SSU rRNA) was determined for 13 species belonging to 10 ascomycete families and for the basidiomycetous anamorphic yeast Rhodotorula glutinis. The sequences were fitted into an alignment of all hitherto published complete or nearly complete eukaryotic small subunit rRNA sequences. The evolutionary relationships within the fungi were examined by construction of a tree from 87 SSU rRNA sequences, corresponding to 71 different species, by means of a distance matrix method and bootstrap analysis. It confirms the early divergence of the zygomycetes and the classical division of the higher fungi into basidiomycetes and ascomycetes. The basidiomycetes are divided into true basidiomycetes and ustomycetes. Within the ascomycetes, the major subdivisions hemiascomycetes and euascomycetes can be recognized. However, Schizosaccharomyces pombe does not belong to the cluster of the hemiascomycetes, to which it is assigned in classical taxonomic schemes, but forms a distinct lineage. Among the euascomycetes, the plectomycetes and the pyrenomycetes can be distinguished. Within the hemiascomycetes, the polyphyly of genera like Pichia or Candida and of families like the Dipodascaceae and the Saccharomycetaceae can be observed.
  452. Neefs, J.-M., Van de Peer, Y., De Rijk, P., Chapelle, S., & De Wachter, R. (1993). Compilation of small ribosomal subunit RNA structures. NUCLEIC ACIDS RESEARCH, 21(13), 3025–3049.
    The database on small ribosomal subunit RNA structure contained 1804 nucleotide sequences on April 23, 1993. This number comprises 365 eukaryotic, 65 archaeal, 1260 bacterial, 30 plastidial, and 84 mitochondrial sequences. These are,stored in the form of an alignment in order to facilitate the use of the database as input for comparative studies on higher-order structure and for reconstruction of phylogenetic trees. The elements of the postulated secondary structure for each molecule are indicated by special symbols. The database is available on-line directly from the authors by ftp and can also be obtained from the EMBL nucleotide sequence library by electronic mail, ftp, and on CD ROM disk.
  453. Van de Peer, Y., Neefs, J.-M., De Rijk, P., & De Wachter, R. (1993). Evolution of eukaryotes as deduced from small ribosomal subunit RNA sequences. BIOCHEMICAL SYSTEMATICS AND ECOLOGY, 21(1), 43–55. Presented at the 1st Meeting of the International Society for Biochemical Systematics.
    Evolutionary trees based on small ribosomal subunit RNA sequences yield a new perspective on eukaryote evolution. In agreement with classical views regarding evolution, animals, green plants, and fungi form monophyletic groups which seem to have originated nearly simultaneously. The evolution of these organisms took place in a relatively short time interval and is characterized by a massive diversification of life forms. In contrast, the dissimilarity among protoctist small ribosomal subunit RNA sequences is huge and exceeds the diversity seen in the entire prokaryotic world. Furthermore, some Protoctista branch off very soon in eukaryote evolution, while others diverge much later. Based on these ribosomal RNA data, Protoctista should be regarded as a collection of independent evolutionary lineages. Because the evolutionary distance between the different groups of Protoctista is, in several cases, larger than the evolutionary distance between plants, fungi and animals, the classification of eukaryotes into four kingdoms seems to be artificial and may not reflect true evolutionary relationships. Generally, eukaryotes are considered to be a relatively recently diverged lineage. Based on ribosomal RNA, however, they seem to be as old as the prokaryote lineages one distinguishes nowadays, namely eubacteria and archaebacteria.
  454. Van de Peer, Y., Neefs, J.-M., De Rijk, P., & De Wachter, R. (1993). Reconstructing evolution from eukaryotic small-ribosomal-subunit RNA sequences : calibration of the molecular clock. JOURNAL OF MOLECULAR EVOLUTION, 37(2), 221–232. Presented at the NATO Advanced research workshop on Genome Organization and Evolution.
    The detailed descriptions now available for the secondary structure of small-ribosomal-subunit RNA, including areas of higly variable primary structure, facilitate the alignment of nucleotide sequences. However, for optimal exploitation of the information contained in the alignment, a method must be available that takes into account the local sequence variability in the computation of evolutionary distance. A quantitative definition for the variability of an alignment position is proposed in this study. It is a parameter in an equation which expresses the probability that the alignment position contains a different nucleotide in two sequences, as a function of the distance separating these sequences, i.e., the number of substitutions per nucleotide that occurred during their divergence. This parameter can be estimated from the distance matrix resulting from the conversion of pairwise sequence dissimilarities into pairwise distances. Alignment positions can then be subdivided into a number of sets of matching variability, and the average variability of each set can be derived. Next, the conversion of dissimilarity into distance can be recalculated for each set of alignment positions separately, using a modified version of the equation that corrects for multiple substitutions and changing for each set the parameter that reflects its average variability. The distances computed for each set are finally averaged, giving a more precise distance estimation. Trees constructed by the algorithm based on variability calibration have a topology markedly different from that of trees constructed from the same alignments in the absence of calibration. This is illustrated by means of trees constructed from small-ribosomal-subunit RNA sequences of Metazoa. A reconstruction of vertebrate evolution based on calibrated alignments matches the consensus view of paleontologists, contrary to trees based on uncalibrated alignments. In trees derived from sequences covering several metazoan phyla, artefacts in topology that are probably due to a high clock rate in certain lineages are avoided.
  455. Hendriks, L., Goris, A., Van de Peer, Y., Neefs, J.-M., Vancanneyt, M., Kersters, K., Berny, J.-F., et al. (1992). Phylogenetic relationships among ascomycetes and ascomycete-like yeasts as deduced from small ribosomal subunit RNA sequences. SYSTEMATIC AND APPLIED MICROBIOLOGY, 15(1), 98–104.
    The primary structure of the small ribosomal subunit RNA (srRNA) molecule of the type strains of the ascosporogenous yeasts Debaryomcyes hansenii, Pichia anomala (synonym: Hansenula anomala), Pichia membranaefaciens, Schizosaccharomyces pombe, Zygosaccharomyces rouxii and Dekkera bruxellensis was determined. The srRNA sequences were aligned with previously published sequences from fungi, including those of 5 candida species, and an evolutionary tree was inferred The srRNA results were compared with chemotaxonomic criteria, e.g. the coenzyme Q system. The heterogeneity of the genera Candida and Pichia is clearly reflected by the srRNA analysis.
  456. Van de Peer, Y., Hendriks, L., Goris, A., Neefs, J.-M., Vancanneyt, M., Kersters, K., Berny, J.-F., et al. (1992). Evolution of basidiomycetous yeasts as deduced from small ribosomal subunit RNA sequences. SYSTEMATIC AND APPLIED MICROBIOLOGY, 15(2), 250–258.
    Complete small ribosomal subunit RNA sequences were used to infer the relationship between several basidiomycetous yeasts, and to resolve the evolutionary position of the basidiomycetes among the fungi. The sequences were determined for Rhodosporidium toruloides (anamorph Rhodotorula glutinis), Filobasidiella neoformans (anamorph Cryptococcus neoformans), Trichosporon cutaneum, Bullera alba and Sporobolomyces roseus. The sequence of Leucosporidium scottii (anamorph formerly named Candida scottii) srRNA has already been published previously (Hendriks et al., J. Mol. Evol. 32, 167-177 (1991)). Using a tree construction program based on a distance matrix, a phylogenetic tree was constructed for all hitherto known fungal srRNA sequences, oomycetes and slime moulds not included. It showed the ascomycetes and the basidiomycetes to be sister groups, probably evolved from a zygomycete-like ancestor and diverged from each other about 840 Myr ago. Among the basidiomycetes, two clearly distinct groups can be recognized, one formed by the teliospore forming species (Rhodosporidium toruloides and Leucosporidium scottii), and the asexual yeast Sporobolomyces roseus, and the other formed by the non-teliospore forming species Filobasidiella neoformans and the asexual yeasts Bullera alba and Trichosporon cutaneum.
  457. Winnepenninckx, B., Backeljau, T., Van de Peer, Y., & De Wachter, R. (1992). Structure of the small ribosomal subunit RNA of the pulmonate snail, Lumicolaria kambeul, and phylogenetic analysis of the metazoa. FEBS LETTERS, 309(2), 123–126.
    The complete nucleotide sequence of the small ribosomal subunit RNA of the gastropod, Limicolaria kambeul, was determined and used to infer a secondary structure model. In order to clarify the phylogenetic position of the Mollusca among the Metazoa, an evolutionary tree was constructed by neighbor-joining, starting from an alignment of small ribosomal subunit RNA sequences. The Mollusca appear to be a monophyletic group, related to Arthropoda and Chordata in an unresolved trichotomy.
  458. Nelissen, B., Wilmotte, A., De Baere, R., Haes, F., Van de Peer, Y., Neefs, J.-M., & De Wachter, R. (1992). Phylogenetic study of cyanobacteria on the basis of 1S ribosomal RNA sequences. BELGIAN JOURNAL OF BOTANY, 125(2), 210–213. Presented at the Symposium on Macromolecular Identification and Classification of Organisms.
    In this study, the 16S rRNA sequences of five filamentous cyanobacteria (Cyanophyceae) have been determined. These sequences were used to construct, by a distance matrix method, a tree topology to depict the phylogenetic relationships among cyanobacteria.
  459. Winnepenninckx, B., Van de Peer, Y., Peeters, K., De Baere, R., & Moens, L. (1992). Study of invertebrate and plant globins : templates and evolutionary trees. BELGIAN JOURNAL OF BOTANY, 125(2), 191–200. Presented at the Symposium on Macromolecular Identification and Classification of Organisms.
    Globin molecules have a very conservative secondary and tertiary structure, whereas their amino acid sequences are rather variable. To determine the residue restrictions imposed on certain sites of the sequence of non-vertebrate globins (i.e. invertebrate globins and leghemoglobins) to retain the three-dimensional structure, templates were built according to the principles of BASHFORD et al. (1987). A comparison was afterwards made between the templates for non-vertebrate globins and the ones Bashford et al. constructed for mainly vertebrates. We constructed evolutionary trees (phenograms) on the basis of the amino acid sequences for different non-vertebrate phyla by two distance matrix methods: neighbor-joining and UPGMA. Trees made by the latter method were compared with trees based on the amino acid composition of globins.
  460. Van de Peer, Y., Neefs, J.-M., De Rijk, P., De Baere, R., Goris, A., Hendriks, L., & De Wachter, R. (1992). Ribosomal RNA as a tool for studying evolution. BELGIAN JOURNAL OF BOTANY, 125(2), 174–190. Presented at the Symposium on Macromolecular Identification and Classification of Organisms.
    Large databases containing hundreds of sequences are available for 5S ribosomal RNA, small ribosomal subunit RNA and large ribosomal subunit RNA. At the moment, small ribosomal subunit RNA is probably the most appropriate molecule for phylogenetic analysis, due to the large number of available sequences covering a wide range of different organisms, its large chain length and low evolutionary rate. Using this molecule, evolutionary relationships ranging from kingdom level to genus level can be studied. Different natural groups can be distinguished within the three domains Bacteria, Archaea and Eucarya. Comparison of evolutionary trees, constructed by means of small ribosomal subunit rRNA and the far smaller 5S rRNA for several eukaryotic groups of organisms, show congruencies as well as discrepancies. Although the same clusters can be distinguished, the observed branching order between these groups is different.
  461. Wilmotte, A., Turner, S., Van de Peer, Y., & Pace, N. R. (1992). Taxonomic study of marine oscillatoriacean strains (Cyanobacteria) with narrow trichomes, 2 : nucleotide sequence analysis of the 16S ribosomal RNA. JOURNAL OF PHYCOLOGY, 28(6), 828–838.
    Partial 16S ribosomal RNA sequences from five marine oscillatoriacean strains with narrow trichomes were determined by a dideoxynucleotide-termination method. A phenogram was constructed by a distance matrix method including a bootstrap analysis. In addition, a consensus tree was built using cladistic analysis. The results were largely congruent and indicate that the five strains belong to two different lineages. The first lineage groups four phycoerythrin-producing strains with the strain PCC7375 (''Phormidium ectocarpi Gomont''). The second cluster groups strain PCC7105 (''Oscillatoria williamsii Drouet'') with the previously studied strain Microcoleus 10mfx. Comparisons to morphological data are made and the taxonomic level of the separations is estimated.
  462. De Rijk, P., Neefs, J.-M., Van de Peer, Y., & De Wachter, R. (1992). Compilation of small ribosomal subunit RNA sequences. NUCLEIC ACIDS RESEARCH, 20(suppl.), 2075–2089.
  463. De Wachter, R., Neefs, J.-M., Goris, A., & Van de Peer, Y. (1992). The gene coding for small ribosomal subunit RNA in the basidiomycete Ustilago maydis contains a group I intron. NUCLEIC ACIDS RESEARCH, 20(6), 1251–1257.
    The nucleotide sequence of the gene coding for small ribosomal subunit RNA in the basidiomycete Ustilago maydis was determined. It revealed the presence of a group I intron with a length of 411 nucleotides. This is the third occurrence of such an intron discovered in a small subunit rRNA gene encoded by a eukaryotic nuclear genome. The other two occurrences are in Pneumocystis carinii, a fungus of uncertain taxonomic status, and Ankistrodesmus stipitatus, a green alga. The nucleotides of the conserved core structure of 101 group I intron sequences present in different genes and genome types were aligned and their evolutionary relatedness was examined. This revealed a cluster including all group I introns hitherto found in eukaryotic nuclear genes coding for small and large subunit rRNAs. A secondary structure model was designed for the area of the Ustilago maydis small ribosomal subunit RNA precursor where the intron is situated. It shows that the internal guide sequence pairing with the intron boundaries fits between two helices of the small subunit rRNA, and that minimal rearrangement of base pairs suffices to achieve the definitive secondary structure of the 18S rRNA upon splicing.
  464. Hendriks, L., Goris, A., Van de Peer, Y., Neefs, J.-M., Vancanneyt, M., Kersters, K., Hennebert, G. L., et al. (1991). Phylogenetic analysis of five medically important Candida species as deduced on the basis of small ribosomal subunit RNA sequences. JOURNAL OF GENERAL MICROBIOLOGY, 137(5), 1223–1230.
    The classification of species belonging to the genus Candida Berkhout is problematic. Therefore, we have determined the small ribosomal subunit RNA (srRNA) sequences of the type strains of three human pathogenic Candida species; Candida krusei, C. lusitaniae and C. tropicalis. The srRNA sequences were aligned with published eukaryotic srRNA sequences and evolutionary trees were inferred using a matrix optimization method. An evolutionary tree comprising all available eukaryotic srRNA sequences, including two other pathogenic Candida species, C. albicans and C. glabrata, showed that the yeasts diverage rather late in the course of eukaryote evolution, namely at the same depth as green plants, ciliates and some smaller taxa. The cluster of the higher fungi consists of 10 ascomycetes and ascomycete-like species with the first branches leading to Neurospora crassa, Pneumocystis carinii, Candida lusitaniae and C. krusei, in that order. Next there is a dichotomous divergence leading to a group consisting of Torulaspora delbrueckii, Saccharomyces cerevisiae, C. glabrata and Kluyveromyces lactis and a smaller group comprising C. tropicalis and C. albicans. The divergence pattern obtained on the basis of srRNA sequence data is also compared to various other chemotaxonomic data.
  465. Neefs, J.-M., Van de Peer, Y., De Rijk, P., Goris, A., & De Wachter, R. (1991). Compilation of small ribosomal subunit RNA sequences. NUCLEIC ACIDS RESEARCH, 19(suppl.), 1987–2015.
  466. Hendriks, L., De Baere, R., Van de Peer, Y., Neefs, J.-M., Goris, A., & De Wachter, R. (1991). The evolutionary position of the rhodophyte Porphyra umbilicalis and the basidiomycete Leucosporidium scottii among other eukaryotes as deduced from complete sequences of small ribosomal subunit RNA. JOURNAL OF MOLECULAR EVOLUTION, 32(2), 167–177.
    The complete small ribosomal subunit RNA (srRNA) sequence was determined for the red alga Porphyra umbilicalis and the basidiomycete Leucosporidium scottii, representing two taxa for which no srRNA sequences were hitherto known. These sequences were aligned with other published complete srRNA sequences of 58 eukaryotes. Evolutionary trees were reconstructed by a matrix optimization method from a dissimilarity matrix based on sections of the alignment that correspond to structurally conservative areas of the molecule that can be aligned unambiguously. The overall topology of the eukaryotic tree thus constructed is as follows: first there is a succession of early diverging branches, leading to a diplomonad, a microsporidian, a euglenoid plus kinetoplastids, an amoeba, and slime molds. Later, a nearly simultaneous radiation seems to occur into a number of taxa comprising the metazoa, the red alga, the sporozoa, the higher fungi, the ciliates, the green plants, plus some other less numerous groups. Because the red alga diverges late in the evolutionary tree, it does not seem to represent a very primitive organism as proposed on the basis of morphological and 5S rRNA sequence data. Asco- and basidiomycetes do not share a common ancestor in our tree as is generally accepted on the basis of conventional criteria. In contrast, when all alignment positions, rather than the more conservative ones, are used to construct the evolutionary tree, higher fungi do form a monophyletic cluster. The hypothesis that higher fungi and red algae might have shared a common origin has been put forward. Although the red alga and fungi seem to diverge at nearly the same time, no such relationship can be detected. The newly determined sequences can be fitted into a secondary structure model for srRNA, which is now relatively well established with the exception of uncertainties in a number of eukaryote-specific expansion areas. A specific structural model featuring a pseudoknot is proposed for one of these areas.
  467. Moens, L., Van Hauwaert, M.-L., De Smet, K., Ver Donck, K., Van de Peer, Y., Van Beeumen, J., Wodak, S., et al. (1990). Structural interpretation of the amino acid sequence of a second domain from the Artemia covalent polymer globin. JOURNAL OF BIOLOGICAL CHEMISTRY, 265(24), 14285–14291.
  468. Van den Eynde, H., Van de Peer, Y., Vandenabeele, H., Van Bogaert, M., & De Wachter, R. (1990). 5S rRNA sequences of myxobacteria and radioresistant bacteria and implications for eubacterial evolution. INTERNATIONAL JOURNAL OF SYSTEMATIC BACTERIOLOGY, 40(4), 399–404.
    5S rRNA sequences were determined for the myxobacteria Cystobacter fuscus, Myxococcus coralloides, Sorangium cellulosum, and Nannocystis exedens and for the radioresistant bacteria Deinococcus radiodurans and Deinococcus radiophilus. A dendrogram was constructed by using weighted pairwise grouping based on these and all other previously known eubacterial 5S rRNA sequences, and this dendrogram showed differences as well as similarities compared with results derived from 16S rRNA analyses. In the dendrogram, Deinococcus 5S rRNA sequences clustered with 5S rRNA sequences of the genus Thermus, as suggested by the results of 16S rRNA analyses. However, in contrast to the 16S rRNA results, the Deinococcus-Thermus cluster divided the 5S rRNA sequences of the alpha subdivision of the class Proteobacteria from the 5S rRNA sequences of the beta and gamma subgroups of the Proteobacteria. The myxobacterial 5S rRNA sequence data failed to confirm the existence of a delta subgroup of the class Proteobacteria, which was suggested by the results of 16S rRNA analyses.
  469. Hendriks, L., Van de Peer, Y., Van Herck, M., Neefs, J.-M., & De Wachter, R. (1990). The 18S ribosomal RNA sequence of the sea anemone Anemonia sulcata and its evolutionary position among other eukaryotes. FEBS LETTERS, 269(2), 445–449.
    Evolutionary trees based on partial small ribosomal subunit RNA sequences of 22 metazoa species have been published [(1988) Science 239, 748-753]. In these trees, cnidarians (Radiata) seemed to have evolved independently from the Bilateria, which is in contradiction with the general evolutionary view. In order to further investigate this problem, the complete srRNA sequence of the sea anemone Ammonia sulcata was determined and evolutionary trees were constructed using a matrix optimization method. In the tree thus obtained the sea anemone and Bilateria together form a monophyletic cluster, with the sea anemone forming the first line of descent of the metazoan group.
  470. Neefs, J.-M., Van de Peer, Y., Hendriks, L., & De Wachter, R. (1990). Compilation of small ribosomal subunit RNA sequences. NUCLEIC ACIDS RESEARCH, 18(suppl.), 2237–2317.
  471. Van den Eynde, H., Van de Peer, Y., Perry, J., & De Wachter, R. (1990). 5S rRNA sequences of representatives of the genera Chlorobium, Prosthecochloris, Thermomicrobium, Cytophaga, Flavobacterium, Flexibacterium and Saprospira and a discussion of the evolution of eubacteria in general. JOURNAL OF GENERAL MICROBIOLOGY, 136, 11–18.
    5S rRNA sequences were determined for the green sulphur bacteria Chlorobium limicola, Chlorobium phaeobacteroides and Prosthecochloris aestuarii, for Thermomicrobium roseum, which is a relative of the green non-sulphur bacteria, and for Cytophaga aquatilis, Cytophaga heparina, Cytophaga johnsonae, Flavobacterium breve, Flexibacter sp. and Saprospira grandis, organisms allotted to the phylum ‘Bacteroides-Cytophaga-Flavobacterium’ and relatives as determined by 16S rRNA analyses. By using a clustering algorithm a dendrogram was constructed from these sequences and from all other known eubacterial 5S RNA sequences. The dendrogram showed differences, as well as similarities, with respect to results obtained by 16S RNA analyses. The 5S RNA sequences of green sulphur bacteria were closely related to one another, and to a cluster containing 5S RNA sequences from Bacteroides and its relatives, including Cytophaga aquatilis. 5S RNA sequences of all other representatives of the ‘Bacteroides-Cytophaga-Flavobacterium’ phylum as distinguished by 16S RNA analysis failed to group with Bacteroides and related clusters. On the basis of 5S RNA sequences, Thermomicrobium roseum clustered with Chloroflexus aurantiacus, as was expected from 16S RNA analysis.
  472. Van de Peer, Y., Neefs, J.-M., & De Wachter, R. (1990). Small ribosomal subunit RNA sequences, evolutionary relationships among different life forms, and mitochondrial origins. JOURNAL OF MOLECULAR EVOLUTION, 30(5), 463–476.
    A tree was constructed from a structurally conserved area in an alignment of 83 small ribosomal subunit sequences of eukaryotic, archaebacterial, eubacterial, plastidial, and mitochondrial origin. The algorithm involved computation and optimization of a dissimilarity matrix. According to the tree, only plant mitochondria belong to the eubacterial primary kingdom, whereas animal, fungal, algal, and ciliate mitochondria branch off from an internal node situated between the three primary kingdoms. This result is at variance with a parsimony tree of similar size published by Cedergren et al. (J Mol Evol 28∶98–112, 1988), which postulates the mitochondria to be monophyletic and to belong to the eubacterial primary kingdom. The discrepancy does not follow from the use of conflicting sequence alignments, hence it must be due to the use of different treeing algorithms. We tested our algorithm on a set of sequences resulting from a simulated evolution and found it capable of faith-fully reconstructing a branching topology that involved very unequal evolutionary rates. The use of more limited or more extended areas of the complete sequence alignment, comprising only very conserved or also more variable portions of the small ribosomal subunit structure, does have some influence on the tree topology. In all cases, however, the nonplant mitochondria seem to branch off before the emergence of eubacteria, and the differences are limited to the branching pattern among different types of mitochondria.
  473. Van de Peer, Y., De Baere, R., Cauwenberghs, J., & De Wachter, R. (1990). Evolution of green plants and their relationship with other photosynthetic eukaryotes as deduced from 5S ribosomal RNA sequences. PLANT SYSTEMATICS AND EVOLUTION, 170(1-2), 85–96.
    The nucleotide sequence of cytoplasmic 5S ribosomal RNAs from three gymnosperms, Pinus contorta, Taxus baccata and Juniperus media and from one fern, Pteridium aquilinum, have been determined. These sequences were aligned with all hitherto known cytoplasmic 5S ribosomal RNA sequences of photosynthetic eukaryotes. A dendrogram based on that set of sequences was constructed by a distance matrix method and the resulting tree compared with established views concerning plant and algal evolution. The following monophyletic groups of photosynthetic eukaryotes are recognizable: the Rhodophyta, a group consisting of Phaeophyta, Bacillariophyta and Chrysophyta, and the green plants, the latter comprising green algae, Bryophyta, Pteridophyta and Spermatophyta. According to our 5S ribosomal RNA tree, green plants may have originated from some type of a green flagellated organism such as Chlamydomonas. The land plants seem to have originated from some form of charophyte such as Nitella. 5S ribosomal RNA seems to be less appropriate to estimate dissimilarities between species which have diverged relatively recently, like the angiosperms. Therefore, a precise evolutionary process is difficult to reconstruct for members of this group.
  474. Van den Eynde, H., De Baere, R., Shah, H. N., Gharbia, S. E., Fox, G. E., Michalik, J., Van de Peer, Y., et al. (1989). 5S Ribosomal ribonucleic acid sequences in Bacteroides and Fusobacterium : evolutionary relationships within these genera and among eubacteria in general. INTERNATIONAL JOURNAL OF SYSTEMATIC BACTERIOLOGY, 39(1), 78–84.
    The 5S ribosomal ribonucleic acid (rRNA) sequences were determined for Bacteroides fragilis, Bacteroides thetaiotaomicron, Bacteroides capillosus, Bacteroides veroralis, Porphyromonas gingivalis, Anaerorhabdus furcosus, Fusobacterium nucleatum, Fusobacterium mortiferum, and Fusobacterium varium. A dendrogram constructed by a clustering algorithm from these sequences, which were aligned with all other hitherto known eubacterial 5S rRNA sequences, showed differences as well as similarities with respect to results derived from 16S rRNA analyses. In the 5S rRNA dendrogram, Bacteroides clustered together with Cytophaga and Fusobacterium, as in 16S rRNA analyses. Intraphylum relationships deduced from 5S rRNAs suggested that Bacteroides is specifically related to Cytophaga rather than to Fusobacterium, as was suggested by 16S rRNA analyses. Previous taxonomic considerations concerning the genus Bacteroides, based on biochemical and physiological data, were confirmed by the 5S rRNA sequence analysis.
  475. Hendriks, L., Goris, A., Neefs, J.-M., Van de Peer, Y., Hennebert, G., & De Wachter, R. (1989). The nucleotide sequence of the small ribosomal subunit RNA of the yeast Candida albicans and the evolutionary position of the fungi among the Eukaryotes. SYSTEMATIC AND APPLIED MICROBIOLOGY, 12(3), 223–229.
    Up to now the small ribosomal subunit RNA sequences of about 50 different eukaryotes have been published, of which only three belong to the fungi. We determined the complete srRNA sequence of the imperfect yeast Candida albicans. The sequence is 1788 nucleotides long and was determined at the DNA level using the dideoxy method with a set of primers specific for conserved sequences of small ribosomal subunit RNA. An evolutionary tree, comprising 58 organisms including C. albicans, was constructed. This tree shows a number of early diverging lineages such as a diplomonad, a microsporidian, an amoeba, slime molds, an euglenoid, kinetoplastids and sporozoans. Next within a relatively short time interval there is a radiation into a number of clusters composed of ciliates, metazoa, fungi and green plants. C. albicans was previously classified in the artificial taxon of imperfect fungi. The evolutionary tree presented in this paper clearly shows C. albicans to belong to the ascomycetous yeasts. An additional aim of this study was the refinement of the srRNA secondary structure model. Although the outline of this model is now well established, no consensus model exists in certain eukaryote-specific areas of high structural variability. The srRNA sequence of xC. albicans was fitted into the secondary structure model and the existence of a pseudoknot is proposed in one of these eukaryote-specific areas.
  476. Dams, E., Hendriks, L., Van de Peer, Y., Neefs, J.-M., Smits, G., Vandenbempt, I., & De Wachter, R. (1988). Compilation of small ribosomal subunit RNA sequences. NUCLEIC ACIDS RESEARCH, 16(suppl.), r87–r173.
  477. Hendriks, L., Van Broeckhoven, C., Vandenberghe, A., Van de Peer, Y., & De Wachter, R. (1988). Primary and secondary structure of the 18S ribosomal RNA of the bird spider Eurypelma californica and evolutionary relationships among eukaryotic phyla. EUROPEAN JOURNAL OF BIOCHEMISTRY, 177(1), 15–20.
    The primary structure of the 185 rRNA of the bird spider Eurypelma californica has been determined in the framework of a study of metazoan phylogeny on the basis of ribosomal RNA structure. A secondary-structure model was derived by comparison of the sequence with that of 43 other eukaryotic small-ribosomal-subunit RNA sequences presently available. This comparison allows a rather detailed secondary-structure pattern to be postulated for a eukaryote-specific area of highly variable sequence and length for which no consensus model has hitherto been attained. A dendrogram, reflecting evolutionary relationships among the 40 eukaryotic species of known 18S rRNA structure, was constructed by a matrix method selecting the best-fitting tree on the basis of a least-squares criterion. The tree shows an early divergence of a microsporidium, an euglenoid, kinetoplastids and a slime mold. Among the remaining species, two main clusters are distinguishable, one comprising the Ciliata, the other comprising Metazoa, green plants, fungi and several protists. Among the Metazoa, the three phyla presently investigated, viz. Chordata. Arthropoda and Nemathelminthes, are distinguishable as three separate lines of descent.
  478. Van den Eynde, H., De Baere, R., De Roeck, E., Van de Peer, Y., Vandenberghe, A., Willekens, P., & De Wachter, R. (1988). The 5S ribosomal RNA sequences of a red algal rhodoplast and a gymnosperm chloroplast : implications for the evolution of plastids and cyanobacteria. JOURNAL OF MOLECULAR EVOLUTION, 27(2), 126–132.
    The 5S ribosomal RNA sequences have been determined for the rhodoplast of the red algaPorphyra umbilicalis and the chloroplast of the coniferJuniperus media. The 5S RNA sequence of theVicia faba chloroplast is corrected with respect to a previous report. A survey of the known sequences and secondary structures of 5S RNAs from plastids and cyanobacteria shows a close structural similarity between all 5S RNAs from land plant chloroplasts. The algal plastid 5S RNAs on the other hand show much more structural diversity and have certain structural features in common with bacterial 5S RNAs. A dendrogram constructed from the aligned sequences by a clustering algorithm points to a common ancestor for the present-living cyanobacteria and the land plant plastids. However, the algal plastids branch off at an early stage within the plastid-cyanobacteria cluster, before the divergence between cyanobacteria and land plant chloroplasts. This evolutionary picture points to the occurrence of multiple endosymbiotic events, with the ancestors of the present algal plastids already established as photosynthetic endosymbionts at a time when the ancestors of the present land plant chloroplasts were still free-living cells.
  479. Vandenabeele, H., Van den Eynde, H., Van de Peer, Y., & De Wachter, R. (1988). The sequence of 5-S rRNA of Deinococcus radiodurans and the evolutionary position of the radioresistant bacteria. ARCHIVES INTERNATIONALES DE PHYSIOLOGIE DE BIOCHIMIE ET DE BIOPHYSIQUE (Vol. 96, pp. B115–B115). Presented at the 137e Bijeenkomst van de Belgische Vereniging voor Biochemie = 137ième Réunion de la Société Belge de Biochimie.
  480. Van den Eynde, H., Van de Peer, Y., & De Wachter, R. (1988). Inferring eubacterial phylogeny from 5S ribosomal RNA structure analysis. In J. M. Olson, J. Ormerod, J. Amesz, E. Stackebrandt, & H. Trüper (Eds.), Green photosynthetic bacteria (pp. 217–221). New York, NY, USA: Plenum Press.