miRWoods: Enhanced precursor detection and stacked random forests for the sensitive detection of microRNAs.
PLoS Comput Biol 2019;
15:e1007309. [PMID:
31596843 PMCID:
PMC6785219 DOI:
10.1371/journal.pcbi.1007309]
[Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 08/05/2019] [Indexed: 12/29/2022] Open
Abstract
MicroRNAs are conserved, endogenous small RNAs with critical post-transcriptional regulatory functions throughout eukaryota, including prominent roles in development and disease. Despite much effort, microRNA annotations still contain errors and are incomplete due especially to challenges related to identifying valid miRs that have small numbers of reads, to properly locating hairpin precursors and to balancing precision and recall. Here, we present miRWoods, which solves these challenges using a duplex-focused precursor detection method and stacked random forests with specialized layers to detect mature and precursor microRNAs, and has been tuned to optimize the harmonic mean of precision and recall. We trained and tuned our discovery pipeline on data sets from the well-annotated human genome, and evaluated its performance on data from mouse. Compared to existing approaches, miRWoods better identifies precursor spans, and can balance sensitivity and specificity for an overall greater prediction accuracy, recalling an average of 10% more annotated microRNAs, and correctly predicts substantially more microRNAs with only one read. We apply this method to the under-annotated genomes of Felis catus (domestic cat) and Bos taurus (cow). We identified hundreds of novel microRNAs in small RNA sequencing data sets from muscle and skin from cat, from 10 tissues from cow and also from human and mouse cells. Our novel predictions include a microRNA in an intron of tyrosine kinase 2 (TYK2) that is present in both cat and cow, as well as a family of mirtrons with two instances in the human genome. Our predictions support a more expanded miR-2284 family in the bovine genome, a larger mir-548 family in the human genome, and a larger let-7 family in the feline genome.
While the computational prediction of microRNA loci from high-throughput sequence data is well-studied, challenges persist in defining the minimum number of reads required for a locus to be evaluated, as well as in defining the precursor span. We present a new method, “miRWoods”, which has greater recall of known microRNAs, while also achieving as good or better overall performance. Our approach uses improved duplex-based methods of precursor detection and a pair of random forest layers that sensitively detect mature products and precursors. We trained our model on data from human, and confirmed that it can successfully be applied cross-species by evaluating predictions for the mouse genome. We then applied our approach to new sequencing data mapped to the under-annotated genomes of cow and cat. We were able to use miRWoods to improve annotations for cat and cow microRNAs, and found novel microRNAs in human and mouse, and identified errors in current annotations.
Collapse