Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

89
(from Reference Citation Analysis)

Article PDFs (21)

Cited by > 0 (83)

Searched Name

Shin-Han Shiu

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Number	Citation Analysis
1	Assessing the evolution of research topics in a biological field using plant science as an example. PLoS Biol 2024;22:e3002612. [PMID: 38781246 PMCID: PMC11115244 DOI: 10.1371/journal.pbio.3002612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 04/04/2024] [Indexed: 05/25/2024] Open Abstract Scientific advances due to conceptual or technological innovations can be revealed by examining how research topics have evolved. But such topical evolution is difficult to uncover and quantify because of the large body of literature and the need for expert knowledge in a wide range of areas in a field. Using plant biology as an example, we used machine learning and language models to classify plant science citations into topics representing interconnected, evolving subfields. The changes in prevalence of topical records over the last 50 years reflect shifts in major research trends and recent radiation of new topics, as well as turnover of model species and vastly different plant science research trajectories among countries. Our approaches readily summarize the topical diversity and evolution of a scientific field with hundreds of thousands of relevant papers, and they can be applied broadly to other fields. Collapse Key Words Collapse MESH Headings Plants Research/trends Machine Learning Botany/trends Botany/methods Collapse Grants National Science Foundation US Department of Energy Collapse
2	Effect of wastewater collection and concentration methods on assessment of viral diversity. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024;908:168128. [PMID: 37918732 DOI: 10.1016/j.scitotenv.2023.168128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 10/23/2023] [Accepted: 10/24/2023] [Indexed: 11/04/2023] Abstract Monitoring of potentially pathogenic human viruses in wastewater is of crucial importance to understand disease trends in communities, predict potential outbreaks, and boost preparedness and response by public health departments. High throughput metagenomic sequencing opens an opportunity to expand the capabilities of wastewater surveillance. However, there are major bottlenecks in the metagenomic enabled wastewater surveillance, including the complexities in selecting appropriate sampling and concentration/virus enrichment methods as well as in bioinformatic analysis of complex samples with low human virus concentrations. To evaluate the abilities of two commonly used sampling and concentration methods in virus identification, virus communities concentrated with Virus Adsorption-Elution (VIRADEL) and PolyEthylene Glycol (PEG) precipitation were compared for three interceptor sites. Results indicated that more viral reads were obtained by the VIRADEL concentration method, with 2.84 ± 0.57 % viral reads in the sample. For samples concentrated with PEG, the average proportion of viral reads in the sample was 0.63 ± 0.19 %. In all wastewater samples, bacteriophage affiliated with the families Siphoviridae, Myoviridae and Podoviridae were found to be the abundant populations. Comparison against a custom Swiss-Prot human virus database indicated that the relatively abundant human viruses (average proportions in human virus community greater than 1.00 %) in samples concentrated with the VIRADEL method were Orthopoxvirus, Rhadinovirus, Parapoxvirus, Varicellovirus, Hepatovirus, Simplexvirus, Molluscipoxvirus, Parechovirus, Lymphocryptovirus, and Spumavirus. In samples concentrated with the PEG method, fewer human viruses were found to be relatively abundant. These were Orthopoxvirus, Rhadinovirus, Varicellovirus, Simplexvirus, Molluscipoxvirus, Lymphocryptovirus, and Betacoronavirus. Contigs of Betacoronavirus, which contains severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), were identified in VIRADEL and PEG samples. Our study demonstrates the feasibility of using metagenomics in wastewater surveillance as a first screening tool and the need for selecting the appropriate virus concentration methods and optimizing bioinformatic approaches in analyzing metagenomic data of wastewater samples. Collapse Key Words Human virus Metagenomics Public health, COVID-19 outbreak Wastewater surveillance Collapse MESH Headings Humans Wastewater Wastewater-Based Epidemiological Monitoring Viruses SARS-CoV-2 Bacteriophages Collapse Grants Collapse
3	Evolution and diversification of the ACT-like domain associated with plant basic helix-loop-helix transcription factors. Proc Natl Acad Sci U S A 2023;120:e2219469120. [PMID: 37126718 PMCID: PMC10175843 DOI: 10.1073/pnas.2219469120] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Accepted: 03/21/2023] [Indexed: 05/03/2023] Open Abstract Basic helix-loop-helix (bHLH) proteins are one of the largest families of transcription factor (TF) in eukaryotes, and ~30% of all flowering plants' bHLH TFs contain the aspartate kinase, chorismate mutase, and TyrA (ACT)-like domain at variable distances C-terminal from the bHLH. However, the evolutionary history and functional consequences of the bHLH/ACT-like domain association remain unknown. Here, we show that this domain association is unique to the plantae kingdom with green algae (chlorophytes) harboring a small number of bHLH genes with variable frequency of ACT-like domain's presence. bHLH-associated ACT-like domains form a monophyletic group, indicating a common origin. Indeed, phylogenetic analysis results suggest that the association of ACT-like and bHLH domains occurred early in Plantae by recruitment of an ACT-like domain in a common ancestor with widely distributed ACT DOMAIN REPEAT (ACR) genes by an ancestral bHLH gene. We determined the functional significance of this association by showing that Chlamydomonas reinhardtii ACT-like domains mediate homodimer formation and negatively affect DNA binding of the associated bHLH domains. We show that, while ACT-like domains have experienced faster selection than the associated bHLH domain, their rates of evolution are strongly and positively correlated, suggesting that the evolution of the ACT-like domains was constrained by the bHLH domains. This study proposes an evolutionary trajectory for the association of ACT-like and bHLH domains with the experimental characterization of the functional consequence in the regulation of plant-specific processes, highlighting the impacts of functional domain coevolution. Collapse Key Words evolution gene regulation green algae protein–protein interaction Collapse MESH Headings Basic Helix-Loop-Helix Transcription Factors/metabolism Phylogeny Plants/genetics Transcription Factors/metabolism Helix-Loop-Helix Motifs Collapse Grants National Science Foundation (NSF) Collapse
4	Evolutionary analysis of the LORELEI gene family in plants reveals regulatory subfunctionalization. PLANT PHYSIOLOGY 2022;190:2539-2556. [PMID: 36156105 PMCID: PMC9706458 DOI: 10.1093/plphys/kiac444] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 08/31/2022] [Indexed: 06/16/2023] Abstract A signaling complex comprising members of the LORELEI (LRE)-LIKE GPI-anchored protein (LLG) and Catharanthus roseus RECEPTOR-LIKE KINASE 1-LIKE (CrRLK1L) families perceive RAPID ALKALINIZATION FACTOR (RALF) peptides and regulate growth, reproduction, immunity, and stress responses in Arabidopsis (Arabidopsis thaliana). Genes encoding these proteins are members of multigene families in most angiosperms and could generate thousands of signaling complex variants. However, the links between expansion of these gene families and the functional diversification of this critical signaling complex as well as the evolutionary factors underlying the maintenance of gene duplicates remain unknown. Here, we investigated LLG gene family evolution by sampling land plant genomes and explored the function and expression of angiosperm LLGs. We found that LLG diversity within major land plant lineages is primarily due to lineage-specific duplication events, and that these duplications occurred both early in the history of these lineages and more recently. Our complementation and expression analyses showed that expression divergence (i.e. regulatory subfunctionalization), rather than functional divergence, explains the retention of LLG paralogs. Interestingly, all but one monocot and all eudicot species examined had an LLG copy with preferential expression in male reproductive tissues, while the other duplicate copies showed highest levels of expression in female or vegetative tissues. The single LLG copy in Amborella trichopoda is expressed vastly higher in male compared to in female reproductive or vegetative tissues. We propose that expression divergence plays an important role in retention of LLG duplicates in angiosperms. Collapse Key Words Collapse MESH Headings Arabidopsis/metabolism Multigene Family Phosphotransferases/genetics Seeds/metabolism Embryophyta/genetics Magnoliopsida/genetics Magnoliopsida/metabolism Proteins/genetics Gene Duplication Evolution, Molecular Phylogeny Collapse Grants T32 GM136536 NIGMS NIH HHS IGERT Comparative Genomics Program at the University of Arizona NSF Graduate Research Fellowship University of Arizona Graduate College Office of Diversity and Inclusion University of Arizona Graduate College University Fellowship NIH Institutional Training Grant in Biochemistry and Molecular Biology NSF University of Arizona Undergraduate Biology Research Program Science and Technology Center Gatsby Charitable Foundation University of Zürich European Research Council under the European Union European Molecular Biology Organization Natural Sciences and Engineering Research Council of Canada Deutsche Forschungsgemeinschaft Natural Science Foundation National Institute of Food and Agriculture U.S. Department of Agriculture Center for Agriculture, Food, and the Environment National Science Foundation U.S. Department of Energy Collapse
5	Temporal regulation of cold transcriptional response in switchgrass. FRONTIERS IN PLANT SCIENCE 2022;13:998400. [PMID: 36299783 PMCID: PMC9589291 DOI: 10.3389/fpls.2022.998400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 09/16/2022] [Indexed: 06/16/2023] Abstract Switchgrass low-land ecotypes have significantly higher biomass but lower cold tolerance compared to up-land ecotypes. Understanding the molecular mechanisms underlying cold response, including the ones at transcriptional level, can contribute to improving tolerance of high-yield switchgrass under chilling and freezing environmental conditions. Here, by analyzing an existing switchgrass transcriptome dataset, the temporal cis-regulatory basis of switchgrass transcriptional response to cold is dissected computationally. We found that the number of cold-responsive genes and enriched Gene Ontology terms increased as duration of cold treatment increased from 30 min to 24 hours, suggesting an amplified response/cascading effect in cold-responsive gene expression. To identify genomic sequences likely important for regulating cold response, machine learning models predictive of cold response were established using k-mer sequences enriched in the genic and flanking regions of cold-responsive genes but not non-responsive genes. These k-mers, referred to as putative cis-regulatory elements (pCREs) are likely regulatory sequences of cold response in switchgrass. There are in total 655 pCREs where 54 are important in all cold treatment time points. Consistent with this, eight of 35 known cold-responsive CREs were similar to top-ranked pCREs in the models and only these eight were important for predicting temporal cold response. More importantly, most of the top-ranked pCREs were novel sequences in cold regulation. Our findings suggest additional sequence elements important for cold-responsive regulation previously not known that warrant further studies. Collapse Key Words Temporal transcriptional response machine learning model interpretation novel cis-regulatory sequences random forest classifier regulation of cold stress Collapse MESH Headings Collapse Grants U.S. Department of Energy Collapse
6	Editorial: Artificial Intelligence and Machine Learning Applications in Plant Genomics and Genetics. Front Artif Intell 2022;5:959470. [PMID: 35832206 PMCID: PMC9271996 DOI: 10.3389/frai.2022.959470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 06/10/2022] [Indexed: 11/15/2022] Open Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
7	Computational prediction of plant metabolic pathways. CURRENT OPINION IN PLANT BIOLOGY 2022;66:102171. [PMID: 35078130 DOI: 10.1016/j.pbi.2021.102171] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 12/07/2021] [Accepted: 12/18/2021] [Indexed: 06/14/2023] Abstract Uncovering genes encoding enzymes responsible for the biosynthesis of diverse plant metabolites is essential for metabolic engineering and production of plant metabolite-derived medicine. With the availability of multi-omics data for an ever-increasing number of plant species and the development of computational approaches, the metabolic pathways of many important plant compounds can be predicted, complementing a more traditional genetic and/or biochemical approach. Here, we summarize recent progress in predicting plant metabolic pathways using genome, transcriptome, proteome, interactome, and/or metabolome data, and the utility of integrating these data with machine learning to further improve metabolic pathway predictions. Collapse Key Words Gene function prediction Machine learning Metabolic pathway membership Multi-omics Collapse MESH Headings Computational Biology Metabolic Engineering Metabolic Networks and Pathways Metabolome/genetics Plants/genetics Plants/metabolism Transcriptome Collapse Grants Collapse
8	The SEEL motif and members of the MYB-related REVEILLE transcription factor family are important for the expression of LORELEI in the synergid cells of the Arabidopsis female gametophyte. PLANT REPRODUCTION 2022;35:61-76. [PMID: 34716496 DOI: 10.1007/s00497-021-00432-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 10/19/2021] [Indexed: 06/13/2023] Abstract Synergid cells in the micropylar end of the female gametophyte are required for critical cell-cell signaling interactions between the pollen tube and the ovule that precede double fertilization and seed formation in flowering plants. LORELEI (LRE) encodes a putative GPI-anchored protein that is expressed primarily in the synergid cells, and together with FERONIA, a receptor-like kinase, it controls pollen tube reception by the receptive synergid cell. Still, how LRE expression is controlled in synergid cells remains poorly characterized. We identified candidate cis-regulatory elements enriched in LRE and other synergid cell-expressed genes. One of the candidate motifs ('TAATATCT') in the LRE promoter was an uncharacterized variant of the Evening Element motif that we named as the Short Evening Element-like (SEEL) motif. Deletion or point mutations in the SEEL motif of the LRE promoter resulted in decreased reporter expression in synergid cells, demonstrating that the SEEL motif is important for expression of LRE in synergid cells. Additionally, we found that LRE expression is decreased in the loss of function mutants of REVEILLE (RVE) transcription factors, which are clock genes known to bind the SEEL and other closely related motifs. We propose that RVE transcription factors regulate LRE expression in synergid cells by binding to the SEEL motif in the LRE promoter. Identification of cis-regulatory elements and transcription factors involved in the expression of LRE will serve as a foundation to characterize the gene regulatory networks in synergid cells. Collapse Key Words Collapse MESH Headings Arabidopsis/genetics Arabidopsis/metabolism Arabidopsis Proteins/genetics Arabidopsis Proteins/metabolism Ovule/genetics Ovule/metabolism Pollen Tube/genetics Transcription Factors/genetics Transcription Factors/metabolism Collapse Grants Collapse
9	Modeling temporal and hormonal regulation of plant transcriptional response to wounding. THE PLANT CELL 2022;34:867-888. [PMID: 34865154 PMCID: PMC8824630 DOI: 10.1093/plcell/koab287] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 11/18/2021] [Indexed: 06/02/2023] Abstract Plants respond to wounding stress by changing gene expression patterns and inducing the production of hormones including jasmonic acid. This wounding transcriptional response activates specialized metabolism pathways such as the glucosinolate pathways in Arabidopsis thaliana. While the regulatory factors and sequences controlling a subset of wound-response genes are known, it remains unclear how wound response is regulated globally. Here, we how these responses are regulated by incorporating putative cis-regulatory elements, known transcription factor binding sites, in vitro DNA affinity purification sequencing, and DNase I hypersensitive sites to predict genes with different wound-response patterns using machine learning. We observed that regulatory sites and regions of open chromatin differed between genes upregulated at early and late wounding time-points as well as between genes induced by jasmonic acid and those not induced. Expanding on what we currently know, we identified cis-elements that improved model predictions of expression clusters over known binding sites. Using a combination of genome editing, in vitro DNA-binding assays, and transient expression assays using native and mutated cis-regulatory elements, we experimentally validated four of the predicted elements, three of which were not previously known to function in wound-response regulation. Our study provides a global model predictive of wound response and identifies new regulatory sequences important for wounding without requiring prior knowledge of the transcriptional regulators. Collapse Key Words Collapse MESH Headings Arabidopsis/drug effects Arabidopsis/genetics Arabidopsis/physiology Cyclopentanes/pharmacology Gene Expression Regulation, Plant Metabolic Networks and Pathways Models, Biological Oxylipins/pharmacology Plant Growth Regulators/pharmacology Plant Growth Regulators/physiology Plants, Genetically Modified Regulatory Sequences, Nucleic Acid Reproducibility of Results Transcription Factors/genetics Collapse Grants National Science Foundation US Department of Energy Great Lakes Bioenergy Research Center Collapse
10	Optimising the use of gene expression data to predict plant metabolic pathway memberships. THE NEW PHYTOLOGIST 2021;231:475-489. [PMID: 33749860 DOI: 10.1111/nph.17355] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 03/13/2021] [Indexed: 06/12/2023] Abstract Plant metabolites from diverse pathways are important for plant survival, human nutrition and medicine. The pathway memberships of most plant enzyme genes are unknown. While co-expression is useful for assigning genes to pathways, expression correlation may exist only under specific spatiotemporal and conditional contexts. Utilising > 600 tomato (Solanum lycopersicum) expression data combinations, three strategies for predicting memberships in 85 pathways were explored. Optimal predictions for different pathways require distinct data combinations indicative of pathway functions. Naive prediction (i.e. identifying pathways with the most similarly expressed genes) is error prone. In 52 pathways, unsupervised learning performed better than supervised approaches, possibly due to limited training data availability. Using gene-to-pathway expression similarities led to prediction models that outperformed those based simply on expression levels. Using 36 experimental validated genes, the pathway-best model prediction accuracy is 58.3%, significantly better compared with that for predicting annotated genes without experimental evidence (37.0%) or random guess (1.2%), demonstrating the importance of data quality. Our study highlights the need to extensively explore expression-based features and prediction strategies to maximise the accuracy of metabolic pathway membership assignment. The prediction framework outlined here can be applied to other species and serves as a baseline model for future comparisons. Collapse Key Words gene expression machine learning metabolic pathway prediction tomato Collapse MESH Headings Gene Expression Genes, Plant Solanum lycopersicum/genetics Metabolic Networks and Pathways/genetics Collapse Grants Collapse
11	Predictive Models of Genetic Redundancy in Arabidopsis thaliana. Mol Biol Evol 2021;38:3397-3414. [PMID: 33871641 PMCID: PMC8321531 DOI: 10.1093/molbev/msab111] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open Abstract Genetic redundancy refers to a situation where an individual with a loss-of-function mutation in one gene (single mutant) does not show an apparent phenotype until one or more paralogs are also knocked out (double/higher-order mutant). Previous studies have identified some characteristics common among redundant gene pairs, but a predictive model of genetic redundancy incorporating a wide variety of features derived from accumulating omics and mutant phenotype data is yet to be established. In addition, the relative importance of these features for genetic redundancy remains largely unclear. Here, we establish machine learning models for predicting whether a gene pair is likely redundant or not in the model plant Arabidopsis thaliana based on six feature categories: functional annotations, evolutionary conservation including duplication patterns and mechanisms, epigenetic marks, protein properties including posttranslational modifications, gene expression, and gene network properties. The definition of redundancy, data transformations, feature subsets, and machine learning algorithms used significantly affected model performance based on holdout, testing phenotype data. Among the most important features in predicting gene pairs as redundant were having a paralog(s) from recent duplication events, annotation as a transcription factor, downregulation during stress conditions, and having similar expression patterns under stress conditions. We also explored the potential reasons underlying mispredictions and limitations of our studies. This genetic redundancy model sheds light on characteristics that may contribute to long-term maintenance of paralogs, and will ultimately allow for more targeted generation of functionally informative double mutants, advancing functional genomic studies. Collapse Key Words genetic redundancy machine learning molecular evolution Collapse MESH Headings Collapse Grants Collapse
12	Contrasting transcriptional responses to Fusarium virguliforme colonization in symptomatic and asymptomatic hosts. THE PLANT CELL 2021;33:224-247. [PMID: 33681966 PMCID: PMC8136916 DOI: 10.1093/plcell/koaa021] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2020] [Accepted: 11/06/2020] [Indexed: 06/12/2023] Abstract The broad host range of Fusarium virguliforme represents a unique comparative system to identify and define differentially induced responses between an asymptomatic monocot host, maize (Zea mays), and a symptomatic eudicot host, soybean (Glycine max). Using a temporal, comparative transcriptome-based approach, we observed that early gene expression profiles of root tissue from infected maize suggest that pathogen tolerance coincides with the rapid induction of senescence dampening transcriptional regulators, including ANACs (Arabidopsis thaliana NAM/ATAF/CUC protein) and Ethylene-Responsive Factors. In contrast, the expression of senescence-associated processes in soybean was coincident with the appearance of disease symptom development, suggesting pathogen-induced senescence as a key pathway driving pathogen susceptibility in soybean. Based on the analyses described herein, we posit that root senescence is a primary contributing factor underlying colonization and disease progression in symptomatic versus asymptomatic host-fungal interactions. This process also supports the lifestyle and virulence of F. virguliforme during biotrophy to necrotrophy transitions. Further support for this hypothesis lies in comprehensive co-expression and comparative transcriptome analyses, and in total, supports the emerging concept of necrotrophy-activated senescence. We propose that F. virguliforme conditions an environment within symptomatic hosts, which favors susceptibility through transcriptomic reprogramming, and as described herein, the induction of pathways associated with senescence during the necrotrophic stage of fungal development. Collapse Key Words Collapse MESH Headings Colony Count, Microbial Fusarium/growth & development Fusarium/physiology Gene Expression Regulation, Plant Host-Pathogen Interactions/genetics Plant Diseases/genetics Plant Diseases/microbiology Glycine max/genetics Glycine max/microbiology Time Factors Transcription Factors/metabolism Transcription, Genetic Transcriptome/genetics Zea mays/genetics Zea mays/microbiology Collapse Grants R01 GM125743 NIGMS NIH HHS MSU Plant Resilience Institute and the National Institutes of General Medical Sciences Research in the laboratory of S.-H.S. was supported by the National Science Foundation Department of Energy Great Lakes Bioenergy Research Center Collapse
13	Impact of short-read sequencing on the misassembly of a plant genome. BMC Genomics 2021;22:99. [PMID: 33530937 PMCID: PMC7852129 DOI: 10.1186/s12864-021-07397-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Accepted: 01/19/2021] [Indexed: 12/16/2022] Open Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07397-5. Collapse Key Words Genome misassembly Machine learning Read coverage Solanum lycopersicum Collapse MESH Headings Collapse Grants Collapse
14	Overcoming the Challenges to Enhancing Experimental Plant Biology With Computational Modeling. FRONTIERS IN PLANT SCIENCE 2021;12:687652. [PMID: 34354723 PMCID: PMC8329482 DOI: 10.3389/fpls.2021.687652] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 06/01/2021] [Indexed: 05/10/2023] Abstract The study of complex biological systems necessitates computational modeling approaches that are currently underutilized in plant biology. Many plant biologists have trouble identifying or adopting modeling methods to their research, particularly mechanistic mathematical modeling. Here we address challenges that limit the use of computational modeling methods, particularly mechanistic mathematical modeling. We divide computational modeling techniques into either pattern models (e.g., bioinformatics, machine learning, or morphology) or mechanistic mathematical models (e.g., biochemical reactions, biophysics, or population models), which both contribute to plant biology research at different scales to answer different research questions. We present arguments and recommendations for the increased adoption of modeling by plant biologists interested in incorporating more modeling into their research programs. As some researchers find math and quantitative methods to be an obstacle to modeling, we provide suggestions for easy-to-use tools for non-specialists and for collaboration with specialists. This may especially be the case for mechanistic mathematical modeling, and we spend some extra time discussing this. Through a more thorough appreciation and awareness of the power of different kinds of modeling in plant biology, we hope to facilitate interdisciplinary, transformative research. Collapse Key Words bioinformatics collaboration computational modeling experimental design mathematical modeling Collapse MESH Headings Collapse Grants Collapse
15	Within- and cross-species predictions of plant specialized metabolism genes using transfer learning. IN SILICO PLANTS 2020;2:diaa005. [PMID: 33344884 PMCID: PMC7731531 DOI: 10.1093/insilicoplants/diaa005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Accepted: 07/21/2020] [Indexed: 06/12/2023] Abstract Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. Plant models like Arabidopsis thaliana have extensive, experimentally derived annotations, whereas many non-model species do not. Here we employed a machine learning strategy, transfer learning, where knowledge from A. thaliana is transferred to predict gene functions in cultivated tomato with fewer experimentally annotated genes. The first tomato SM/GM prediction model using only tomato data performs well (F-measure = 0.74, compared with 0.5 for random and 1.0 for perfect predictions), but from manually curating 88 SM/GM genes, we found many mis-predicted entries were likely mis-annotated. When the SM/GM prediction models built with A. thaliana data were used to filter out genes where the A. thaliana-based model predictions disagreed with tomato annotations, the new tomato model trained with filtered data improved significantly (F-measure = 0.92). Our study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information. Additionally, our findings provide an example for transfer learning in genomics where knowledge can be transferred from an information-rich species to an information-poor one. Collapse Key Words Cross-species gene prediction specialized metabolism transfer learning Collapse MESH Headings Collapse Grants T32 GM110523 NIGMS NIH HHS National Science Foundation National Institute of General Medical Sciences National Institutes of Health U.S. Department of Energy Great Lakes Bioenergy Research Center Michigan AgBioResearch U.S. Department of Agriculture National Institute of Food and Agriculture Collapse
16	The cis-regulatory codes of response to combined heat and drought stress in Arabidopsis thaliana. NAR Genom Bioinform 2020;2:lqaa049. [PMID: 33575601 PMCID: PMC7671360 DOI: 10.1093/nargab/lqaa049] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 05/22/2020] [Accepted: 07/06/2020] [Indexed: 11/24/2022] Open Abstract Plants respond to their environment by dynamically modulating gene expression. A powerful approach for understanding how these responses are regulated is to integrate information about cis-regulatory elements (CREs) into models called cis-regulatory codes. Transcriptional response to combined stress is typically not the sum of the responses to the individual stresses. However, cis-regulatory codes underlying combined stress response have not been established. Here we modeled transcriptional response to single and combined heat and drought stress in Arabidopsis thaliana. We grouped genes by their pattern of response (independent, antagonistic and synergistic) and trained machine learning models to predict their response using putative CREs (pCREs) as features (median F-measure = 0.64). We then developed a deep learning approach to integrate additional omics information (sequence conservation, chromatin accessibility and histone modification) into our models, improving performance by 6.2%. While pCREs important for predicting independent and antagonistic responses tended to resemble binding motifs of transcription factors associated with heat and/or drought stress, important synergistic pCREs resembled binding motifs of transcription factors not known to be associated with stress. These findings demonstrate how in silico approaches can improve our understanding of the complex codes regulating response to combined stress and help us identify prime targets for future characterization. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
17	Evolution of a plant gene cluster in Solanaceae and emergence of metabolic diversity. eLife 2020;9:e56717. [PMID: 32613943 PMCID: PMC7386920 DOI: 10.7554/elife.56717] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Accepted: 07/01/2020] [Indexed: 12/15/2022] Open Abstract Plants produce phylogenetically and spatially restricted, as well as structurally diverse specialized metabolites via multistep metabolic pathways. Hallmarks of specialized metabolic evolution include enzymatic promiscuity and recruitment of primary metabolic enzymes and examples of genomic clustering of pathway genes. Solanaceae glandular trichomes produce defensive acylsugars, with sidechains that vary in length across the family. We describe a tomato gene cluster on chromosome 7 involved in medium chain acylsugar accumulation due to trichome specific acyl-CoA synthetase and enoyl-CoA hydratase genes. This cluster co-localizes with a tomato steroidal alkaloid gene cluster and is syntenic to a chromosome 12 region containing another acylsugar pathway gene. We reconstructed the evolutionary events leading to this gene cluster and found that its phylogenetic distribution correlates with medium chain acylsugar accumulation across the Solanaceae. This work reveals insights into the dynamics behind gene cluster evolution and cell-type specific metabolite diversity. Collapse Key Words Solanum lycopersicum Solanum pennellii Solanum quitoense biochemistry chemical biology plant biology Collapse MESH Headings Conserved Sequence/genetics Evolution, Molecular Genes, Plant/genetics Genetic Variation/genetics Metabolic Networks and Pathways/genetics Multigene Family/genetics Solanaceae/genetics Solanaceae/metabolism Solanum/genetics Solanum/metabolism Trichomes/metabolism Collapse Grants 1757043 National Science Foundation T32 GM110523 NIGMS NIH HHS BER DE-SC0018409 U.S. Department of Energy 1811055 National Science Foundation 1655386 National Science Foundation 1546617 National Science Foundation 1727362 National Science Foundation National Institutes of Health Collapse
18	Opening the Black Box: Interpretable Machine Learning for Geneticists. Trends Genet 2020;36:442-455. [PMID: 32396837 DOI: 10.1016/j.tig.2020.03.005] [Citation(s) in RCA: 104] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 03/12/2020] [Accepted: 03/16/2020] [Indexed: 01/16/2023] Abstract Because of its ability to find complex patterns in high dimensional and heterogeneous data, machine learning (ML) has emerged as a critical tool for making sense of the growing amount of genetic and genomic data available. While the complexity of ML models is what makes them powerful, it also makes them difficult to interpret. Fortunately, efforts to develop approaches that make the inner workings of ML models understandable to humans have improved our ability to make novel biological insights. Here, we discuss the importance of interpretable ML, different strategies for interpreting ML models, and examples of how these strategies have been applied. Finally, we identify challenges and promising future directions for interpretable ML in genetics and genomics. Collapse Key Words deep learning interpretable machine learning predictive biology Collapse MESH Headings Collapse Grants Collapse
19	Putative cis-Regulatory Elements Predict Iron Deficiency Responses in Arabidopsis Roots. PLANT PHYSIOLOGY 2020;182:1420-1439. [PMID: 31937681 PMCID: PMC7054882 DOI: 10.1104/pp.19.00760] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Accepted: 12/22/2019] [Indexed: 05/03/2023] Abstract Plant iron deficiency (-Fe) activates a complex regulatory network that coordinates root Fe uptake and distribution to sink tissues. In Arabidopsis (Arabidopsis thaliana), FER-LIKE FE DEFICIENCY-INDUCED TRANSCRIPTION FACTOR (FIT), a basic helix-loop-helix (bHLH) transcription factor (TF), regulates root Fe acquisition genes. Many other -Fe-induced genes are FIT independent, and instead regulated by other bHLH TFs and by yet unknown TFs. The cis-regulatory code, that is, the cis-regulatory elements (CREs) and their combinations that regulate plant -Fe-responses, remains largely elusive. Using Arabidopsis root transcriptome data and coexpression clustering, we identified over 100 putative CREs (pCREs) that predicted -Fe-induced gene expression in computational models. To assess pCRE properties and possible functions, we used large-scale in vitro TF binding data, positional bias, and evolutionary conservation. As one example, our approach uncovered pCREs resembling IDE1 (iron deficiency-responsive element 1), a known grass -Fe response CRE. Arabidopsis IDE1-likes were associated with FIT-dependent gene expression, more specifically with biosynthesis of Fe-chelating compounds. Thus, IDE1 seems to be conserved in grass and nongrass species. Our pCREs matched among others in vitro binding sites of B3, NAC, bZIP, and TCP TFs, which might be regulators of -Fe responses. Altogether, our findings provide a comprehensive source of cis-regulatory information for -Fe-responsive genes that advance our mechanistic understanding and inform future efforts in engineering plants with more efficient Fe uptake or transport systems. Collapse Key Words Collapse MESH Headings Arabidopsis/genetics Arabidopsis/metabolism Arabidopsis Proteins/genetics Arabidopsis Proteins/metabolism Gene Expression Regulation, Plant Plant Roots/genetics Plant Roots/metabolism Regulatory Sequences, Nucleic Acid/genetics Collapse Grants Deutsche Forschungsgemeinschaft National Science Foundation NSF \| BIO (Biological Sciences) \| Division of Integrative Organismal Systems NSF \| BIO (Biological Sciences) \| Division of Environmental Biology NSF \| EHR (Education and Human Resources) \| Division of Graduate Education U.S. Department of Energy Collapse
20	Improved recovery of cell-cycle gene expression in Saccharomyces cerevisiae from regulatory interactions in multiple omics data. BMC Genomics 2020;21:159. [PMID: 32054475 PMCID: PMC7020519 DOI: 10.1186/s12864-020-6554-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Accepted: 02/04/2020] [Indexed: 12/11/2022] Open Abstract BACKGROUND Gene expression is regulated by DNA-binding transcription factors (TFs). Together with their target genes, these factors and their interactions collectively form a gene regulatory network (GRN), which is responsible for producing patterns of transcription, including cyclical processes such as genome replication and cell division. However, identifying how this network regulates the timing of these patterns, including important interactions and regulatory motifs, remains a challenging task. RESULTS We employed four in vivo and in vitro regulatory data sets to investigate the regulatory basis of expression timing and phase-specific patterns cell-cycle expression in Saccharomyces cerevisiae. Specifically, we considered interactions based on direct binding between TF and target gene, indirect effects of TF deletion on gene expression, and computational inference. We found that the source of regulatory information significantly impacts the accuracy and completeness of recovering known cell-cycle expressed genes. The best approach involved combining TF-target and TF-TF interactions features from multiple datasets in a single model. In addition, TFs important to multiple phases of cell-cycle expression also have the greatest impact on individual phases. Important TFs regulating a cell-cycle phase also tend to form modules in the GRN, including two sub-modules composed entirely of unannotated cell-cycle regulators (STE12-TEC1 and RAP1-HAP1-MSN4). CONCLUSION Our findings illustrate the importance of integrating both multiple omics data and regulatory motifs in order to understand the significance regulatory interactions involved in timing gene expression. This integrated approached allowed us to recover both known cell-cycles interactions and the overall pattern of phase-specific expression across the cell-cycle better than any single data set. Likewise, by looking at regulatory motifs in the form of TF-TF interactions, we identified sets of TFs whose co-regulation of target genes was important for cell-cycle expression, even when regulation by individual TFs was not. Overall, this demonstrates the power of integrating multiple data sets and models of interaction in order to understand the regulatory basis of established biological processes and their associated gene regulatory networks. Collapse Key Words Computational biology Gene expression Gene regulation Machine learning Modeling Collapse MESH Headings Collapse Grants Collapse
21	Transcriptome-Based Prediction of Complex Traits in Maize. THE PLANT CELL 2020;32:139-151. [PMID: 31641024 PMCID: PMC6961623 DOI: 10.1105/tpc.19.00332] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 09/24/2019] [Accepted: 10/21/2019] [Indexed: 05/11/2023] Abstract The ability to predict traits from genome-wide sequence information (i.e., genomic prediction) has improved our understanding of the genetic basis of complex traits and transformed breeding practices. Transcriptome data may also be useful for genomic prediction. However, it remains unclear how well transcript levels can predict traits, particularly when traits are scored at different development stages. Using maize (Zea mays) genetic markers and transcript levels from seedlings to predict mature plant traits, we found that transcript and genetic marker models have similar performance. When the transcripts and genetic markers with the greatest weights (i.e., the most important) in those models were used in one joint model, performance increased. Furthermore, genetic markers important for predictions were not close to or identified as regulatory variants for important transcripts. These findings demonstrate that transcript levels are useful for predicting traits and that their predictive power is not simply due to genetic variation in the transcribed genomic regions. Finally, genetic marker models identified only 1 of 14 benchmark flowering-time genes, while transcript models identified 5. These data highlight that, in addition to being useful for genomic prediction, transcriptome data can provide a link between traits and variation that cannot be readily captured at the sequence level. Collapse Key Words Collapse MESH Headings Genetic Markers Genetic Variation Genome, Plant/genetics Genome-Wide Association Study Genomics Models, Genetic Multifactorial Inheritance Phenotype Transcriptome Zea mays/genetics Collapse Grants T32 GM110523 NIGMS NIH HHS National Science Foundation Graduate Research Fellowship Program Graduate Research Opportunities Abroad U.S. Department of Energy Great Lakes Bioenergy Research Center National Science Foundation Collapse
22	Cis-Regulatory Code for Predicting Plant Cell-Type Transcriptional Response to High Salinity. PLANT PHYSIOLOGY 2019;181:1739-1751. [PMID: 31551359 PMCID: PMC6878017 DOI: 10.1104/pp.19.00653] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Accepted: 09/11/2019] [Indexed: 05/08/2023] Abstract Multicellular organisms have diverse cell types with distinct roles in development and responses to the environment. At the transcriptional level, the differences in the environmental response between cell types are due to differences in regulatory programs. In plants, although cell-type environmental responses have been examined, it is unclear how these responses are regulated. Here, we identify a set of putative cis-regulatory elements (pCREs) enriched in the promoters of genes responsive to high-salinity stress in six Arabidopsis (Arabidopsis thaliana) root cell types. We then use these pCREs to establish cis-regulatory codes (i.e. models predicting whether a gene is responsive to high salinity for each cell type with machine learning). These pCRE-based models outperform models using in vitro binding data of 758 Arabidopsis transcription factors. Surprisingly, organ pCREs identified based on the whole-root high-salinity response can predict cell-type responses as well as pCREs derived from cell-type data, because organ and cell-type pCREs predict complementary subsets of high-salinity response genes. Our findings not only advance our understanding of the regulatory mechanisms of the plant spatial transcriptional response through cis-regulatory codes but also suggest broad applicability of the approach to any species, particularly those with little or no trans-regulatory data. Collapse Key Words Collapse MESH Headings Base Sequence Gene Expression Regulation, Plant Machine Learning Organ Specificity/genetics Plant Cells/metabolism Plant Roots/genetics Protein Binding Regulatory Sequences, Nucleic Acid/genetics Salinity Transcription Factors/metabolism Transcription, Genetic Up-Regulation/genetics Collapse Grants U.S. National Science Foundation U.S. Department of Energy Great Lakes Bioenergy Research Center National Science Foundation Graduate Research Fellowship Collapse
23	Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. G3 (BETHESDA, MD.) 2019;9:3691-3702. [PMID: 31533955 PMCID: PMC6829122 DOI: 10.1534/g3.119.400498] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 09/09/2019] [Indexed: 12/21/2022] Abstract The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values. Collapse Key Words GenPred Genomic Prediction Genomic selection Shared Data Resources artificial neural network genotype-to-phenotype Collapse MESH Headings Benchmarking Genomics/methods Genotype Machine Learning Neural Networks, Computer Phenotype Plants/genetics Collapse Grants R01 GM099992 NIGMS NIH HHS Collapse
24	A Model-Based Approach for Identifying Functional Intergenic Transcribed Regions and Noncoding RNAs. Mol Biol Evol 2019;35:1422-1436. [PMID: 29554332 DOI: 10.1093/molbev/msy035] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open Abstract With advances in transcript profiling, the presence of transcriptional activities in intergenic regions has been well established. However, whether intergenic expression reflects transcriptional noise or activity of novel genes remains unclear. We identified intergenic transcribed regions (ITRs) in 15 diverse flowering plant species and found that the amount of intergenic expression correlates with genome size, a pattern that could be expected if intergenic expression is largely nonfunctional. To further assess the functionality of ITRs, we first built machine learning models using Arabidopsis thaliana as a model that accurately distinguish functional sequences (benchmark protein-coding and RNA genes) and likely nonfunctional ones (pseudogenes and unexpressed intergenic regions) by integrating 93 biochemical, evolutionary, and sequence-structure features. Next, by applying the models genome-wide, we found that 4,427 ITRs (38%) and 796 annotated ncRNAs (44%) had features significantly similar to benchmark protein-coding or RNA genes and thus were likely parts of functional genes. Approximately 60% of ITRs and ncRNAs were more similar to nonfunctional sequences and were likely transcriptional noise. The predictive framework established here provides not only a comprehensive look at how functional, genic sequences are distinct from likely nonfunctional ones, but also a new way to differentiate novel genes from genomic regions with noisy transcriptional activities. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
25	Expression and regulatory asymmetry of retained Arabidopsis thaliana transcription factor genes derived from whole genome duplication. BMC Evol Biol 2019;19:77. [PMID: 30866803 PMCID: PMC6416927 DOI: 10.1186/s12862-019-1398-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Accepted: 02/22/2019] [Indexed: 12/19/2022] Open Abstract Background Transcription factors (TFs) play a key role in regulating plant development and response to environmental stimuli. While most genes revert to single copy after whole genome duplication (WGD) event, transcription factors are retained at a significantly higher rate. Little is known about how TF duplicates have diverged in their expression and regulation, the answer to which may contribute to a better understanding of the elevated retention rate among TFs. Results Here we assessed what features may explain differences in the retention of TF duplicates and other genes using Arabidopsis thaliana as a model. We integrated 34 expression, sequence, and conservation features to build a linear model for predicting the extent of duplicate retention following WGD events among TFs and 19 groups of genes with other functions. We found that TFs was the least well predicted, demonstrating the features of TFs are substantially deviated from duplicate genes in other function groups. Consistent with this, the evolution of TF expression patterns and cis-regulatory cites favors the partitioning of ancestral states among the resulting duplicates: one “ancestral” TF duplicate retains most ancestral expression and cis-regulatory sites, while the “non-ancestral” duplicate is enriched for novel regulatory sites. By modeling the retention of ancestral expression and cis-regulatory states in duplicate pairs using a system of differential equations, we found that TF duplicate pairs in a partitioned state are preferentially maintained. Conclusions These TF duplicates with asymmetrically partitioned ancestral states are likely maintained because one copy retains ancestral functions while the other, at least in some cases, acquires novel cis-regulatory sites that may be important for novel, adaptive traits. Electronic supplementary material The online version of this article (10.1186/s12862-019-1398-z) contains supplementary material, which is available to authorized users. Collapse Key Words Duplicate retention Expression divergence cis-regulatory evolution Collapse MESH Headings Collapse Grants Collapse
26	Factors Influencing Gene Family Size Variation Among Related Species in a Plant Family, Solanaceae. Genome Biol Evol 2018;10:2596-2613. [PMID: 30239695 PMCID: PMC6171734 DOI: 10.1093/gbe/evy193] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/29/2018] [Indexed: 12/13/2022] Open Abstract Gene duplication and loss contribute to gene content differences as well as phenotypic divergence across species. However, the extent to which gene content varies among closely related plant species and the factors responsible for such variation remain unclear. Here, using the Solanaceae family as a model and Pfam domain families as a proxy for gene families, we investigated variation in gene family sizes across species and the likely factors contributing to the variation. We found that genes in highly variable families have high turnover rates and tend to be involved in processes that have diverged between Solanaceae species, whereas genes in low-variability families tend to have housekeeping roles. In addition, genes in high- and low-variability gene families tend to be duplicated by tandem and whole genome duplication, respectively. This finding together with the observation that genes duplicated by different mechanisms experience different selection pressures suggest that duplication mechanism impacts gene family turnover. We explored using pseudogene number as a proxy for gene loss but discovered that a substantial number of pseudogenes are actually products of pseudogene duplication, contrary to the expectation that most plant pseudogenes are remnants of once-functional duplicates. Our findings reveal complex relationships between variation in gene family size, gene functions, duplication mechanism, and evolutionary rate. The patterns of lineage-specific gene family expansion within the Solanaceae provide the foundation for a better understanding of the genetic basis underlying phenotypic diversity in this economically important family. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
27	Regulatory Divergence in Wound-Responsive Gene Expression between Domesticated and Wild Tomato. THE PLANT CELL 2018;30:1445-1460. [PMID: 29743197 PMCID: PMC6096591 DOI: 10.1105/tpc.18.00194] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2018] [Revised: 04/20/2018] [Accepted: 05/07/2018] [Indexed: 05/20/2023] Abstract The evolution of transcriptional regulatory mechanisms is central to how stress response and tolerance differ between species. However, it remains largely unknown how divergence in cis-regulatory sites and, subsequently, transcription factor (TF) binding specificity contribute to stress-responsive expression divergence, particularly between wild and domesticated species. By profiling wound-responsive gene transcriptomes in wild Solanum pennellii and domesticated S. lycopersicum, we found extensive wound response divergence and identified 493 S. lycopersicum and 278 S. pennellii putative cis-regulatory elements (pCREs) that were predictive of wound-responsive gene expression. Only 24-52% of these wound response pCREs (depending on wound response patterns) were consistently enriched in the putative promoter regions of wound-responsive genes across species. In addition, between these two species, their differences in pCRE site sequences were significantly and positively correlated with differences in wound-responsive gene expression. Furthermore, ∼11-39% of pCREs were specific to only one of the species and likely bound by TFs from different families. These findings indicate substantial regulatory divergence in these two plant species that diverged ∼3-7 million years ago. Our study provides insights into the mechanistic basis of how the transcriptional response to wounding is regulated and, importantly, the contribution of cis-regulatory components to variation in wound-responsive gene expression between a wild and a domesticated plant species. Collapse Key Words Collapse MESH Headings Gene Expression Profiling Gene Expression Regulation, Plant/genetics Solanum lycopersicum/genetics Plant Proteins/genetics Plant Proteins/metabolism Transcription Factors/genetics Transcription Factors/metabolism Collapse Grants National Science Foundation U.S. Department of Energy Collapse
28	Differential Cross Section and Photon-Beam Asymmetry for the γ[over →]p → π^{-}Δ^{++}(1232) Reaction at Forward π^{-} Angles for E_{γ}=1.5-2.95 GeV. PHYSICAL REVIEW LETTERS 2018;120:202004. [PMID: 29864366 DOI: 10.1103/physrevlett.120.202004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Revised: 03/11/2018] [Indexed: 06/08/2023] Abstract Differential cross sections and photon-beam asymmetries for the γ[over →]p→π^{-}Δ^{++}(1232) reaction have been measured for 0.7<cosθ_{π}^{c.m.}<1 and E_{γ}=1.5-2.95 GeV at SPring-8/LEPS. The first-ever high statistics cross-section data are obtained in this kinematical region, and the asymmetry data for 1.5<E_{γ}(GeV)<2.8 are obtained for the first time. This reaction has a unique feature for studying the production mechanisms of a pure uu[over ¯] quark pair in the final state from the proton. Although there is no distinct peak structure in the cross sections, a non-negligible excess over the theoretical predictions is observed at E_{γ}=1.5-1.8 GeV. The asymmetries are found to be negative in most of the present kinematical regions, suggesting the dominance of π exchange in the t channel. The negative asymmetries at forward meson production angles are different from the asymmetries previously measured for the photoproduction reactions producing a dd[over ¯] or an ss[over ¯] quark pair in the final state. Advanced theoretical models introducing nucleon resonances and additional unnatural-parity exchanges are needed to reproduce the present data. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
29	Recovery from N Deprivation Is a Transcriptionally and Functionally Distinct State in Chlamydomonas. PLANT PHYSIOLOGY 2018;176:2007-2023. [PMID: 29288234 PMCID: PMC5841715 DOI: 10.1104/pp.17.01546] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Accepted: 12/26/2017] [Indexed: 05/20/2023] Abstract Facing adverse conditions such as nitrogen (N) deprivation, microalgae enter cellular quiescence, a reversible cell cycle arrest with drastic changes in metabolism allowing cells to remain viable. Recovering from N deprivation and quiescence is an active and orderly process as we are showing here for Chlamydomonas reinhardtii We conducted comparative transcriptomics on this alga to discern processes relevant to quiescence in the context of N deprivation and recovery following refeeding. A mutant with slow recovery from N deprivation, compromised hydrolysis of triacylglycerols7 (cht7), was included to better define the regulatory processes governing the respective transitions. We identified an ordered set of biological processes with expression patterns that showed sequential reversal following N resupply and uncovered acclimation responses specific to the recovery phase. Biochemical assays and microscopy validated selected inferences made based on the transcriptional analyses. These comprise (1) the restoration of N source preference and cellular bioenergetics during the early stage of recovery; (2) flagellum-based motility in the mid to late stage of recovery; and (3) recovery phase-specific gene groups cooperating in the rapid replenishment of chloroplast proteins. In the cht7 mutant, a large number of programmed responses failed to readjust in a timely manner. Finally, evidence is provided for the involvement of the cAMP-protein kinase A pathway in gating the recovery. We conclude that the recovery from N deprivation represents not simply a reversal of processes directly following N deprivation, but a distinct cellular state. Collapse Key Words Collapse MESH Headings Acclimatization Cell Cycle Chlamydomonas/genetics Chlamydomonas/metabolism Chlamydomonas/ultrastructure Cyclic AMP/metabolism Cyclic AMP-Dependent Protein Kinases/metabolism Galactolipids/metabolism Gene Expression Profiling Gene Expression Regulation Lipid Metabolism/genetics Metabolome/genetics Mutation/genetics Nitrogen/deficiency Oxidation-Reduction Sequence Analysis, RNA Transcription, Genetic Transcriptome/genetics Collapse Grants National Science Foundation U.S. Department of Energy Collapse
30	Defining Functional Genic Regions in the Human Genome through Integration of Biochemical, Evolutionary, and Genetic Evidence. Mol Biol Evol 2017;34:1788-1798. [PMID: 28398576 DOI: 10.1093/molbev/msx101] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open Abstract The human genome is dominated by large tracts of DNA with extensive biochemical activity but no known function. In particular, it is well established that transcriptional activities are not restricted to known genes. However, whether this intergenic transcription represents activity with functional significance or noise is under debate, highlighting the need for an effective method of defining functional genomic regions. Moreover, these discoveries raise the question whether genomic regions can be defined as functional based solely on the presence of biochemical activities, without considering evolutionary (conservation) and genetic (effects of mutations) evidence. Here, computational models integrating genetic, evolutionary, and biochemical evidence are established that provide reliable predictions of human protein-coding and RNA genes. Importantly, in addition to sequence conservation, biochemical features allow accurate predictions of genic sequences with phenotypic evidence under strong purifying selection, suggesting that they can be used as an alternative measure of selection. Moreover, 18.5% of annotated noncoding RNAs exhibit higher degrees of similarity to phenotype genes and, thus, are likely functional. However, 64.5% of noncoding RNAs appear to belong to a sequence class of their own, and the remaining 17% are more similar to pseudogenes and random intergenic sequences that may represent noisy transcription. Collapse Key Words chromatin state conservation functional genomic region random forest classification Collapse MESH Headings Collapse Grants Collapse
31	Elevated auxin biosynthesis and transport underlie high vein density in C₄ leaves. Proc Natl Acad Sci U S A 2017;114:E6884-E6891. [PMID: 28761000 PMCID: PMC5565467 DOI: 10.1073/pnas.1709171114] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open Abstract High vein density, a distinctive trait of C4 leaves, is central to both C3-to-C4 evolution and conversion of C3 to C4-like crops. We tested the hypothesis that high vein density in C4 leaves is due to elevated auxin biosynthesis and transport in developing leaves. Up-regulation of genes in auxin biosynthesis pathways and higher auxin content were found in developing C4 leaves compared with developing C3 leaves. The same observation held for maize foliar (C4) and husk (C3) leaf primordia. Moreover, auxin content and vein density were increased in loss-of-function mutants of Arabidopsis MYC2, a suppressor of auxin biosynthesis. Treatment with an auxin biosynthesis inhibitor or an auxin transport inhibitor led to much fewer veins in new leaves. Finally, both Arabidopsis thaliana auxin efflux transporter pin1 and influx transporter lax2 mutants showed reduced vein numbers. Thus, development of high leaf vein density requires elevated auxin biosynthesis and transport. Collapse Key Words C4 plants auxin biosynthesis auxin transport vein density Collapse MESH Headings Arabidopsis/genetics Arabidopsis/growth & development Arabidopsis/metabolism Biological Transport/genetics Biosynthetic Pathways/genetics Gene Expression Regulation, Developmental Gene Expression Regulation, Plant Indoleacetic Acids/metabolism Membrane Transport Proteins/genetics Membrane Transport Proteins/metabolism Mutation Plant Development/genetics Plant Leaves/genetics Plant Leaves/growth & development Plant Leaves/metabolism Plant Proteins/genetics Plant Proteins/metabolism Plants/classification Plants/genetics Plants/metabolism Species Specificity Zea mays/genetics Zea mays/growth & development Zea mays/metabolism Collapse Grants Collapse
32	A rare case of plastid protein-coding gene duplication in the chloroplast genome of Euglena archaeoplastidiata (Euglenophyta). JOURNAL OF PHYCOLOGY 2017;53:493-502. [PMID: 28295310 DOI: 10.1111/jpy.12531] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2016] [Accepted: 02/15/2017] [Indexed: 06/06/2023] Abstract Gene duplication is an important evolutionary process that allows duplicate functions to diverge, or, in some cases, allows for new functional gains. However, in contrast to the nuclear genome, gene duplications within the chloroplast are extremely rare. Here, we present the chloroplast genome of the photosynthetic protist Euglena archaeoplastidiata. Upon annotation, it was found that the chloroplast genome contained a novel tandem direct duplication that encoded a portion of RuBisCO large subunit (rbcL) followed by a complete copy of ribosomal protein L32 (rpl32), as well as the associated intergenic sequences. Analyses of the duplicated rpl32 were inconclusive regarding selective pressures, although it was found that substitutions in the duplicated region, all non-synonymous, likely had a neutral functional effect. The duplicated region did not exhibit patterns consistent with previously described mechanisms for tandem direct duplications, and demonstrated an unknown mechanism of duplication. In addition, a comparison of this chloroplast genome to other previously characterized chloroplast genomes from the same family revealed characteristics that indicated E. archaeoplastidiata was probably more closely related to taxa in the genera Monomorphina, Cryptoglena, and Euglenaria than it was to other Euglena taxa. Taken together, the chloroplast genome of E. archaeoplastidiata demonstrated multiple characteristics unique to the euglenoid world, and has justified the longstanding curiosity regarding this enigmatic taxon. Collapse Key Words Euglena archaeoplastidiata Euglenophyta chloroplast gene duplication genomics rpl32 Collapse MESH Headings Amino Acid Sequence Base Sequence Euglena/classification Euglena/genetics Gene Duplication Genome, Chloroplast Phylogeny Plastids/chemistry Plastids/genetics Ribulose-Bisphosphate Carboxylase/chemistry Ribulose-Bisphosphate Carboxylase/genetics Collapse Grants Collapse
33	Correction: Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779. PLoS Genet 2017;13:e1006802. [PMID: 28542203 PMCID: PMC5441573 DOI: 10.1371/journal.pgen.1006802] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open Abstract [This corrects the article DOI: 10.1371/journal.pgen.1003064.]. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
34	Predictive Models of Spatial Transcriptional Response to High Salinity. PLANT PHYSIOLOGY 2017;174:450-464. [PMID: 28373393 PMCID: PMC5411138 DOI: 10.1104/pp.16.01828] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Accepted: 03/27/2017] [Indexed: 05/12/2023] Abstract Plants are exposed to a variety of environmental conditions, and their ability to respond to environmental variation depends on the proper regulation of gene expression in an organ-, tissue-, and cell type-specific manner. Although our knowledge of how stress responses are regulated is accumulating, a genome-wide model of how plant transcription factors (TFs) and cis-regulatory elements control spatially specific stress response has yet to emerge. Using Arabidopsis (Arabidopsis thaliana) as a model, we identified a set of 1,894 putative cis-regulatory elements (pCREs) that are associated with high-salinity (salt) up-regulated genes in the root or the shoot. We used these pCREs to develop computational models that can better predict salt up-regulated genes in the root and shoot compared with models based on known TF binding motifs. In addition, we incorporated TF binding sites identified via large-scale in vitro assays, chromatin accessibility, evolutionary conservation, and pCRE combinatorial relationships in machine learning models and found that only consideration of pCRE combinations led to better performance in salt up-regulation prediction in the root and shoot. Our results suggest that the plant organ transcriptional response to high salinity is regulated by a core set of pCREs and provide a genome-wide view of the cis-regulatory code of plant spatial transcriptional responses to environmental stress. Collapse Key Words Collapse MESH Headings Arabidopsis/genetics Arabidopsis/metabolism Arabidopsis Proteins/metabolism Base Sequence Binding Sites/genetics Computer Simulation Gene Expression Regulation, Plant Gene Regulatory Networks Genome, Plant/genetics Models, Genetic Plant Roots/genetics Plant Roots/metabolism Plant Shoots/genetics Plant Shoots/metabolism Protein Binding Regulatory Elements, Transcriptional/genetics Salinity Stress, Physiological Transcription Factors/metabolism Collapse Grants Collapse
35	Utility and Limitations of Using Gene Expression Data to Identify Functional Associations. PLoS Comput Biol 2016;12:e1005244. [PMID: 27935950 PMCID: PMC5147789 DOI: 10.1371/journal.pcbi.1005244] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Accepted: 11/13/2016] [Indexed: 01/25/2023] Open Abstract Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets. There remain genes with no known function even in the most well studied, model species. One common way to hypothesize gene function is based on the assumption that genes with similar expression profiles tend to have similar functions. However, using datasets and biological pathway information from the model plant Arabidopsis thaliana as an example, we discovered that, although genes in the same pathways are functionally related, genes in only a subset of the pathways have highly similar expression patterns. In addition, our ability to hypothesize gene functions based on expression is significantly impacted by how the dataset is processed and combined as well as the methodology used to identify genes with similar expression. Therefore, multiple datasets and methods should be tested to maximize the functional information that we can get based on similarity in gene expression. Collapse Key Words Collapse MESH Headings Arabidopsis Proteins/classification Arabidopsis Proteins/genetics Arabidopsis Proteins/metabolism Computational Biology/methods Databases, Genetic Gene Expression Profiling/methods Molecular Sequence Annotation Oligonucleotide Array Sequence Analysis/methods Proteins/classification Proteins/genetics Proteins/metabolism Stress, Physiological/genetics Collapse Grants National Science Foundation Collapse
36	Diversity, expansion, and evolutionary novelty of plant DNA-binding transcription factor families. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2016;1860:3-20. [PMID: 27522016 DOI: 10.1016/j.bbagrm.2016.08.005] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2016] [Revised: 07/21/2016] [Accepted: 08/06/2016] [Indexed: 12/19/2022] Abstract Plant transcription factors (TFs) that interact with specific sequences via DNA-binding domains are crucial for regulating transcriptional initiation and are fundamental to plant development and environmental response. In addition, expansion of TF families has allowed functional divergence of duplicate copies, which has contributed to novel, and in some cases adaptive, traits in plants. Thus, TFs are central to the generation of the diverse plant species that we see today. Major plant agronomic traits, including those relevant to domestication, have also frequently arisen through changes in TF coding sequence or expression patterns. Here our goal is to provide an overview of plant TF evolution by first comparing the diversity of DNA-binding domains and the sizes of these domain families in plants and other eukaryotes. Because TFs are among the most highly expanded gene families in plants, the birth and death process of TFs as well as the mechanisms contributing to their retention are discussed. We also provide recent examples of how TFs have contributed to novel traits that are important in plant evolution and in agriculture.This article is part of a Special Issue entitled: Plant Gene Regulatory Mechanisms and Networks, edited by Dr. Erich Grotewold and Dr. Nathan Springer. Collapse Key Words Evolutionary conservation Family expansion Plant novelties Retention mechanisms Transcription factor domain families Collapse MESH Headings DNA, Plant/genetics DNA-Binding Proteins/genetics DNA-Binding Proteins/metabolism Evolution, Molecular Gene Expression Regulation, Plant/genetics Plant Proteins/genetics Plant Proteins/metabolism Plants/genetics Plants/metabolism Transcription Factors/genetics Transcription Factors/metabolism Transcription, Genetic/genetics Collapse Grants Collapse
37	Evolution of Gene Duplication in Plants. PLANT PHYSIOLOGY 2016;171:2294-316. [PMID: 27288366 PMCID: PMC4972278 DOI: 10.1104/pp.16.00523] [Citation(s) in RCA: 737] [Impact Index Per Article: 92.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Accepted: 05/17/2016] [Indexed: 05/18/2023] Abstract Ancient duplication events and a high rate of retention of extant pairs of duplicate genes have contributed to an abundance of duplicate genes in plant genomes. These duplicates have contributed to the evolution of novel functions, such as the production of floral structures, induction of disease resistance, and adaptation to stress. Additionally, recent whole-genome duplications that have occurred in the lineages of several domesticated crop species, including wheat (Triticum aestivum), cotton (Gossypium hirsutum), and soybean (Glycine max), have contributed to important agronomic traits, such as grain quality, fruit shape, and flowering time. Therefore, understanding the mechanisms and impacts of gene duplication will be important to future studies of plants in general and of agronomically important crops in particular. In this review, we survey the current knowledge about gene duplication, including gene duplication mechanisms, the potential fates of duplicate genes, models explaining duplicate gene retention, the properties that distinguish duplicate from singleton genes, and the evolutionary impact of gene duplication. Collapse Key Words Collapse MESH Headings Evolution, Molecular Gene Duplication Genome, Plant/genetics Phylogeny Plants/genetics Collapse Grants Collapse
38	A novel method for identifying polymorphic transposable elements via scanning of high-throughput short reads. DNA Res 2016;23:241-51. [PMID: 27098848 PMCID: PMC4909310 DOI: 10.1093/dnares/dsw011] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2015] [Accepted: 02/21/2016] [Indexed: 11/16/2022] Open Abstract Identification of polymorphic transposable elements (TEs) is important because TE polymorphism creates genetic diversity and influences the function of genes in the host genome. However, de novo scanning of polymorphic TEs remains a challenge. Here, we report a novel computational method, called PTEMD (polymorphic TEs and their movement detection), for de novo discovery of genome-wide polymorphic TEs. PTEMD searches highly identical sequences using reads supported breakpoint evidences. Using PTEMD, we identified 14 polymorphic TE families (905 sequences) in rice blast fungus Magnaporthe oryzae, and 68 (10,618 sequences) in maize. We validated one polymorphic TE family experimentally, MoTE-1; all MoTE-1 family members are located in different genomic loci in the three tested isolates. We found that 57.1% (8 of 14) of the PTEMD-detected polymorphic TE families in M. oryzae are active. Furthermore, our data indicate that there are more polymorphic DNA transposons in maize than their counterparts of retrotransposons despite the fact that retrotransposons occupy largest fraction of genomic mass. We demonstrated that PTEMD is an effective tool for identifying polymorphic TEs in M. oryzae and maize genomes. PTEMD and the genome-wide polymorphic TEs in M. oryzae and maize are publically available at http://www.kanglab.cn/blast/PTEMD_V1.02.htm. Collapse Key Words high-throughput sequencing maize polymorphic transposon rice blast fungus Collapse MESH Headings Collapse Grants Collapse
39	A G-Box-Like Motif Is Necessary for Transcriptional Regulation by Circadian Pseudo-Response Regulators in Arabidopsis. PLANT PHYSIOLOGY 2016;170:528-39. [PMID: 26586835 PMCID: PMC4704597 DOI: 10.1104/pp.15.01562] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 11/17/2015] [Indexed: 05/18/2023] Abstract PSEUDO-RESPONSE REGULATORs (PRRs) play overlapping and distinct roles in maintaining circadian rhythms and regulating diverse biological processes, including the photoperiodic control of flowering, growth, and abiotic stress responses. PRRs act as transcriptional repressors and associate with chromatin via their conserved C-terminal CCT (CONSTANS, CONSTANS-like, and TIMING OF CAB EXPRESSION 1 [TOC1/PRR1]) domains by a still-poorly understood mechanism. Here, we identified genome-wide targets of PRR9 using chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and compared them with PRR7, PRR5, and TOC1/PRR1 ChIP-seq data. We found that PRR binding sites are located within genomic regions of low nucleosome occupancy and high DNase I hypersensitivity. Moreover, conserved noncoding regions among Brassicaceae species are enriched around PRR binding sites, indicating that PRRs associate with functionally relevant cis-regulatory regions. The PRRs shared a significant number of binding regions, and our results indicate that they coordinately restrict the expression of target genes to around dawn. A G-box-like motif was overrepresented at PRR binding regions, and we showed that this motif is necessary for mediating transcriptional regulation of CIRCADIAN CLOCK ASSOCIATED 1 and PRR9 by the PRRs. Our results further our understanding of how PRRs target specific promoters and provide an extensive resource for studying circadian regulatory networks in plants. Collapse Key Words Collapse MESH Headings Arabidopsis/genetics Arabidopsis Proteins/genetics Arabidopsis Proteins/metabolism Binding Sites Chromatin Immunoprecipitation Circadian Rhythm/genetics Gene Expression Regulation, Plant Genome, Plant Nucleotide Motifs Plants, Genetically Modified Promoter Regions, Genetic Regulatory Sequences, Nucleic Acid Repressor Proteins/genetics Transcription Factors/genetics Transcription Factors/metabolism Collapse Grants Collapse
40	The Impact of the Branched-Chain Ketoacid Dehydrogenase Complex on Amino Acid Homeostasis in Arabidopsis. PLANT PHYSIOLOGY 2015;169:1807-20. [PMID: 25986129 PMCID: PMC4634046 DOI: 10.1104/pp.15.00461] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 05/15/2015] [Indexed: 05/05/2023] Abstract The branched-chain amino acids (BCAAs) Leu, Ile, and Val are among nine essential amino acids that must be obtained from the diet of humans and other animals, and can be nutritionally limiting in plant foods. Despite genetic evidence of its importance in regulating seed amino acid levels, the full BCAA catabolic network is not completely understood in plants, and limited information is available regarding its regulation. In this study, transcript coexpression analyses revealed positive correlations among BCAA catabolism genes in stress, development, diurnal/circadian, and light data sets. A core subset of BCAA catabolism genes, including those encoding putative branched-chain ketoacid dehydrogenase subunits, is highly expressed during the night in plants on a diel cycle and in prolonged darkness. Mutants defective in these subunits accumulate higher levels of BCAAs in mature seeds, providing genetic evidence for their function in BCAA catabolism. In addition, prolonged dark treatment caused the mutants to undergo senescence early and overaccumulate leaf BCAAs. These results extend the previous evidence that BCAAs can be catabolized and serve as respiratory substrates at multiple steps. Moreover, comparison of amino acid profiles between mature seeds and dark-treated leaves revealed differences in amino acid accumulation when BCAA catabolism is perturbed. Together, these results demonstrate the consequences of blocking BCAA catabolism during both normal growth conditions and under energy-limited conditions. Collapse Key Words Collapse MESH Headings 3-Methyl-2-Oxobutanoate Dehydrogenase (Lipoamide)/genetics 3-Methyl-2-Oxobutanoate Dehydrogenase (Lipoamide)/metabolism Amino Acids, Branched-Chain/metabolism Arabidopsis/enzymology Arabidopsis/genetics Arabidopsis/physiology Arabidopsis/radiation effects Arabidopsis Proteins/genetics Arabidopsis Proteins/metabolism Darkness Energy Metabolism Homeostasis Light Metabolic Networks and Pathways Mutation Seeds/enzymology Seeds/genetics Seeds/physiology Seeds/radiation effects Collapse Grants Collapse
41	Transcriptional coordination of physiological responses in Nannochloropsis oceanica CCMP1779 under light/dark cycles. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2015. [PMID: 26216534 DOI: 10.1111/tpj.12944] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023] Abstract Nannochloropsis oceanica CCMP1779 is a marine unicellular stramenopile and an emerging reference species for basic research on oleogenic microalgae with biotechnological relevance. We investigated its physiology and transcriptome under light/dark cycles. We observed oscillations in lipid content and a predominance of cell division in the first half of the dark phase. Globally, more than 60% of the genes cycled in N. oceanica CCMP1779, with gene expression peaking at different times of the day. Interestingly, the phase of expression of genes involved in certain biological processes was conserved across photosynthetic lineages. Furthermore, in agreement with our physiological studies we found the processes of lipid metabolism and cell division enriched in cycling genes. For example, there was tight coordination of genes involved in the lower part of glycolysis, fatty acid synthesis and lipid production at dawn preceding lipid accumulation during the day. Our results suggest that diel lipid storage plays a key role for N. oceanica CCMP1779 growth under natural conditions making this alga a promising model to gain a basic mechanistic understanding of triacylglycerol production in photosynthetic cells. Our data will help the formulation of new hypotheses on the role of cyclic gene expression in cell growth and metabolism in Nannochloropsis. Collapse Key Words Nannochloropsis oceanica cell cycle diel lipids metabolism transcriptome Collapse MESH Headings Acetyl Coenzyme A/metabolism Carbon/metabolism Cell Cycle/genetics Citric Acid Cycle/physiology Fatty Acids/genetics Fatty Acids/metabolism Gene Expression Regulation Glycolysis Lipid Metabolism/genetics Photoperiod Stramenopiles/genetics Stramenopiles/metabolism Stramenopiles/physiology Collapse Grants Collapse
42	Molecular Evidence for Functional Divergence and Decay of a Transcription Factor Derived from Whole-Genome Duplication in Arabidopsis thaliana. PLANT PHYSIOLOGY 2015;168:1717-34. [PMID: 26103993 PMCID: PMC4528766 DOI: 10.1104/pp.15.00689] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Accepted: 06/03/2015] [Indexed: 05/23/2023] Abstract Functional divergence between duplicate transcription factors (TFs) has been linked to critical events in the evolution of land plants and can result from changes in patterns of expression, binding site divergence, and/or interactions with other proteins. Although plant TFs tend to be retained post polyploidization, many are lost within tens to hundreds of million years. Thus, it can be hypothesized that some TFs in plant genomes are in the process of becoming pseudogenes. Here, we use a pair of salt tolerance-conferring transcription factors, DWARF AND DELAYED FLOWERING1 (DDF1) and DDF2, that duplicated through paleopolyploidy 50 to 65 million years ago, as examples to illustrate potential mechanisms leading to duplicate retention and loss. We found that the expression patterns of Arabidopsis thaliana (At)DDF1 and AtDDF2 have diverged in a highly asymmetric manner, and AtDDF2 has lost most inferred ancestral stress responses. Consistent with promoter disablement, the AtDDF2 promoter has fewer predicted cis-elements and a methylated repetitive element. Through comparisons of AtDDF1, AtDDF2, and their Arabidopsis lyrata orthologs, we identified significant differences in binding affinities and binding site preference. In particular, an AtDDF2-specific substitution within the DNA-binding domain significantly reduces binding affinity. Cross-species analyses indicate that both AtDDF1 and AtDDF2 are under selective constraint, but among A. thaliana accessions, AtDDF2 has a higher level of nonsynonymous nucleotide diversity compared with AtDDF1. This may be the result of selection in different environments or may point toward the possibility of ongoing functional decay despite retention for millions of years after gene duplication. Collapse Key Words Collapse MESH Headings Amino Acid Sequence Arabidopsis/genetics Arabidopsis Proteins/chemistry Arabidopsis Proteins/genetics Arabidopsis Proteins/metabolism Binding Sites/genetics Cold Temperature Evolution, Molecular Gene Duplication Gene Expression Regulation, Plant/drug effects Genetic Variation Genome, Plant/genetics Models, Molecular Molecular Sequence Data Phylogeny Plant Roots/genetics Plant Shoots/genetics Protein Binding Protein Structure, Tertiary Sequence Homology, Amino Acid Sodium Chloride/pharmacology Transcription Factors/classification Transcription Factors/genetics Transcription Factors/metabolism Collapse Grants R01 GM084953 NIGMS NIH HHS Collapse
43	Characteristics of Plant Essential Genes Allow for within- and between-Species Prediction of Lethal Mutant Phenotypes. THE PLANT CELL 2015;27:2133-47. [PMID: 26286535 PMCID: PMC4568498 DOI: 10.1105/tpc.15.00051] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Revised: 06/22/2015] [Accepted: 07/25/2015] [Indexed: 05/18/2023] Abstract Essential genes represent critical cellular components whose disruption results in lethality. Characteristics shared among essential genes have been uncovered in fungal and metazoan model systems. However, features associated with plant essential genes are largely unknown and the full set of essential genes remains to be discovered in any plant species. Here, we show that essential genes in Arabidopsis thaliana have distinct features useful for constructing within- and cross-species prediction models. Essential genes in A. thaliana are often single copy or derived from older duplications, highly and broadly expressed, slow evolving, and highly connected within molecular networks compared with genes with nonlethal mutant phenotypes. These gene features allowed the application of machine learning methods that predicted known lethal genes as well as an additional 1970 likely essential genes without documented phenotypes. Prediction models from A. thaliana could also be applied to predict Oryza sativa and Saccharomyces cerevisiae essential genes. Importantly, successful predictions drew upon many features, while any single feature was not sufficient. Our findings show that essential genes can be distinguished from genes with nonlethal phenotypes using features that are similar across kingdoms and indicate the possibility for translational application of our approach to species without extensive functional genomic and phenomic resources. Collapse Key Words Collapse MESH Headings Arabidopsis/genetics Evolution, Molecular Gene Dosage Gene Expression Regulation, Plant Gene Ontology Genes, Essential/genetics Genes, Lethal/genetics Genes, Plant/genetics Mutation Oryza/genetics Phenotype Saccharomyces cerevisiae Species Specificity Support Vector Machine Collapse Grants Collapse
44	Determinants of nucleosome positioning and their influence on plant gene expression. Genome Res 2015;25:1182-95. [PMID: 26063739 PMCID: PMC4510002 DOI: 10.1101/gr.188680.114] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/10/2015] [Indexed: 01/05/2023] Abstract Nucleosome positioning influences the access of transcription factors (TFs) to their binding sites and gene expression. Studies in plant, animal, and fungal models demonstrate similar nucleosome positioning patterns along genes and correlations between occupancy and expression. However, the relationships among nucleosome positioning, cis-regulatory element accessibility, and gene expression in plants remain undefined. Here we showed that plant nucleosome depletion occurs on specific 6-mer motifs and this sequence-specific nucleosome depletion is predictive of expression levels. Nucleosome-depleted regions in Arabidopsis thaliana tend to have higher G/C content, unlike yeast, and are centered on specific G/C-rich 6-mers, suggesting that intrinsic sequence properties, such as G/C content, cannot fully explain plant nucleosome positioning. These 6-mer motif sites showed higher DNase I hypersensitivity and are flanked by strongly phased nucleosomes, consistent with known TF binding sites. Intriguingly, this 6-mer-specific nucleosome depletion pattern occurs not only in promoter but also in genic regions and is significantly correlated with higher gene expression level, a phenomenon also found in rice but not in yeast. Among the 6-mer motifs enriched in genes responsive to treatment with the defense hormone jasmonate, there are no significant changes in nucleosome occupancy, suggesting that these sites are potentially preconditioned to enable rapid response without changing chromatin state significantly. Our study provides a global assessment of the joint contribution of nucleosome occupancy and motif sequences that are likely cis-elements to the control of gene expression in plants. Our findings pave the way for further understanding the impact of chromatin state on plant transcriptional regulatory circuits. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
45	Retained duplicate genes in green alga Chlamydomonas reinhardtii tend to be stress responsive and experience frequent response gains. BMC Genomics 2015;16:149. [PMID: 25880851 PMCID: PMC4364661 DOI: 10.1186/s12864-015-1335-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Accepted: 02/09/2015] [Indexed: 01/08/2023] Open Abstract Background Green algae belong to a group of photosynthetic organisms that occupy diverse habitats, are closely related to land plants, and have been studied as sources of food and biofuel. Although multiple green algal genomes are available, a global comparative study of algal gene families has not been carried out. To investigate how gene families and gene expression have evolved, particularly in the context of stress response that have been shown to correlate with gene family expansion in multiple eukaryotes, we characterized the expansion patterns of gene families in nine green algal species, and examined evolution of stress response among gene duplicates in Chlamydomonas reinhardtii. Results Substantial variation in domain family sizes exists among green algal species. Lineage-specific expansion of families occurred throughout the green algal lineage but inferred gene losses occurred more often than gene gains, suggesting a continuous reduction of algal gene repertoire. Retained duplicates tend to be involved in stress response, similar to land plant species. However, stress responsive genes tend to be pseudogenized as well. When comparing ancestral and extant gene stress response state, we found that response gains occur in 13% of duplicate gene branches, much higher than 6% in Arabidopsis thaliana. Conclusion The frequent gains of stress response among green algal duplicates potentially reflect a high rate of innovation, resulting in a species-specific gene repertoire that contributed to adaptive response to stress. This could be further explored towards deciphering the mechanism of stress response, and identifying suitable green algal species for oil production. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1335-5) contains supplementary material, which is available to authorized users. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
46	Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes. PLANT PHYSIOLOGY 2015;167:25-39. [PMID: 25384563 PMCID: PMC4280997 DOI: 10.1104/pp.114.245027] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Accepted: 11/02/2014] [Indexed: 05/18/2023] Abstract The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. Collapse Key Words Collapse MESH Headings Databases, Genetic/standards Exons/genetics Genes, Plant/genetics Genome, Plant/genetics Introns/genetics Models, Genetic Molecular Sequence Annotation/methods Molecular Sequence Annotation/standards Pseudogenes/genetics Quality Control RNA, Untranslated/genetics Zea mays/genetics Collapse Grants Collapse
47	The causes and molecular consequences of polyploidy in flowering plants. Ann N Y Acad Sci 2014;1320:16-34. [PMID: 24903334 DOI: 10.1111/nyas.12466] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Abstract Polyploidy is an important force shaping plant genomes. All flowering plants are descendants of an ancestral polyploid species, and up to 70% of extant vascular plant species are believed to be recent polyploids. Over the past century, a significant body of knowledge has accumulated regarding the prevalence and ecology of polyploid plants. In this review, we summarize our current understanding of the causes and molecular consequences of polyploidization in angiosperms. We also provide a discussion on the relationships between polyploidy and adaptation and suggest areas where further research may provide a better understanding of polyploidy. Collapse Key Words adaptation expression divergence fractionation molecular consequences of polyploidy plants whole-genome duplication Collapse MESH Headings Collapse Grants Collapse
48	Consequences of Whole-Genome Triplication as Revealed by Comparative Genomic Analyses of the Wild Radish Raphanus raphanistrum and Three Other Brassicaceae Species. THE PLANT CELL 2014;26:1925-1937. [PMID: 24876251 PMCID: PMC4079359 DOI: 10.1105/tpc.114.124297] [Citation(s) in RCA: 86] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2014] [Revised: 03/30/2014] [Accepted: 04/30/2014] [Indexed: 05/18/2023] Abstract Polyploidization events are frequent among flowering plants, and the duplicate genes produced via such events contribute significantly to plant evolution. We sequenced the genome of wild radish (Raphanus raphanistrum), a Brassicaceae species that experienced a whole-genome triplication event prior to diverging from Brassica rapa. Despite substantial gene gains in these two species compared with Arabidopsis thaliana and Arabidopsis lyrata, ∼70% of the orthologous groups experienced gene losses in R. raphanistrum and B. rapa, with most of the losses occurring prior to their divergence. The retained duplicates show substantial divergence in sequence and expression. Based on comparison of A. thaliana and R. raphanistrum ortholog floral expression levels, retained radish duplicates diverged primarily via maintenance of ancestral expression level in one copy and reduction of expression level in others. In addition, retained duplicates differed significantly from genes that reverted to singleton state in function, sequence composition, expression patterns, network connectivity, and rates of evolution. Using these properties, we established a statistical learning model for predicting whether a duplicate would be retained postpolyploidization. Overall, our study provides new insights into the processes of plant duplicate loss, retention, and functional divergence and highlights the need for further understanding factors controlling duplicate gene fate. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
49	MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. PLANT PHYSIOLOGY 2014;164:513-24. [PMID: 24306534 PMCID: PMC3912085 DOI: 10.1104/pp.113.230144] [Citation(s) in RCA: 275] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2013] [Accepted: 11/26/2013] [Indexed: 05/18/2023] Abstract We have optimized and extended the widely used annotation engine MAKER in order to better support plant genome annotation efforts. New features include better parallelization for large repeat-rich plant genomes, noncoding RNA annotation capabilities, and support for pseudogene identification. We have benchmarked the resulting software tool kit, MAKER-P, using the Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) genomes. Here, we demonstrate the ability of the MAKER-P tool kit to automatically update, extend, and revise the Arabidopsis annotations in light of newly available data and to annotate pseudogenes and noncoding RNAs absent from The Arabidopsis Informatics Resource 10 build. Our results demonstrate that MAKER-P can be used to manage and improve the annotations of even Arabidopsis, perhaps the best-annotated plant genome. We have also installed and benchmarked MAKER-P on the Texas Advanced Computing Center. We show that this public resource can de novo annotate the entire Arabidopsis and maize genomes in less than 3 h and produce annotations of comparable quality to those of the current The Arabidopsis Information Resource 10 and maize V2 annotation builds. Collapse Key Words Collapse MESH Headings Alternative Splicing/genetics Arabidopsis/genetics Computational Biology/methods Exons/genetics Genes, Plant/genetics Genome, Plant/genetics Molecular Sequence Annotation/methods Pseudogenes/genetics Repetitive Sequences, Nucleic Acid/genetics Reproducibility of Results Software Zea mays/genetics Collapse Grants Collapse
50	Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis. PLANT PHYSIOLOGY 2013;161:210-24. [PMID: 23132786 PMCID: PMC3532253 DOI: 10.1104/pp.112.205245] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2012] [Accepted: 10/21/2012] [Indexed: 05/23/2023] Abstract The Arabidopsis (Arabidopsis thaliana) genome is the most well-annotated plant genome. However, transcriptome sequencing in Arabidopsis continues to suggest the presence of polyadenylated (polyA) transcripts originating from presumed intergenic regions. It is not clear whether these transcripts represent novel noncoding or protein-coding genes. To understand the nature of intergenic polyA transcription, we first assessed its abundance using multiple messenger RNA sequencing data sets. We found 6,545 intergenic transcribed fragments (ITFs) occupying 3.6% of Arabidopsis intergenic space. In contrast to transcribed fragments that map to protein-coding and RNA genes, most ITFs are significantly shorter, are expressed at significantly lower levels, and tend to be more data set specific. A surprisingly large number of ITFs (32.1%) may be protein coding based on evidence of translation. However, our results indicate that these "translated" ITFs tend to be close to and are likely associated with known genes. To investigate if ITFs are under selection and are functional, we assessed ITF conservation through cross-species as well as within-species comparisons. Our analysis reveals that 237 ITFs, including 49 with translation evidence, are under strong selective constraint and relatively distant from annotated features. These ITFs are likely parts of novel genes. However, the selective pressure imposed on most ITFs is similar to that of randomly selected, untranscribed intergenic sequences. Our findings indicate that despite the prevalence of ITFs, apart from the possibility of genomic contamination, many may be background or noisy transcripts derived from "junk" DNA, whose production may be inherent to the process of transcription and which, on rare occasions, may act as catalysts for the creation of novel genes. Collapse Key Words Collapse MESH Headings Arabidopsis/genetics Arabidopsis/metabolism Base Sequence Conserved Sequence DNA, Intergenic/genetics DNA, Intergenic/metabolism DNA, Plant/genetics DNA, Plant/metabolism Evolution, Molecular Gene Expression Regulation, Plant Genes, Plant Molecular Sequence Annotation Plants, Genetically Modified/genetics Plants, Genetically Modified/metabolism Protein Biosynthesis Pseudogenes RNA, Messenger/genetics RNA, Messenger/metabolism RNA, Plant/genetics RNA, Plant/metabolism Ribosomes/genetics Ribosomes/metabolism Selection, Genetic Sequence Analysis, RNA Transcription, Genetic Collapse Grants Collapse