1
|
Long W, Li T, Yang Y, Shen HB. FlyIT: Drosophila Embryogenesis Image Annotation based on Image Tiling and Convolutional Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:194-204. [PMID: 31425122 DOI: 10.1109/tcbb.2019.2935723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With the rise of image-based transcriptomics, spatial gene expression data has become increasingly important for understanding gene regulations from the tissue level down to the cell level. Especially, the gene expression images of Drosophila embryos provide a new data source in the study of Drosophila embryogenesis. It is imperative to develop automatic annotation tools since manual annotation is labor-intensive and requires professional knowledge. Although a lot of image annotation methods have been proposed in the computer vision field, they may not work well for gene expression images, due to the great difference between these two annotation tasks. Besides the apparent difference on images, the annotation is performed at the gene level rather than the image level, where the expression patterns of a gene are recorded in multiple images. Moreover, the annotation terms often correspond to local expression patterns of images, yet they are assigned collectively to groups of images and the relations between the terms and single images are unknown. In order to learn the spatial expression patterns comprehensively for genes, we propose a new method, called FlyIT (image annotation based on Image Tiling and convolutional neural networks for fruit Fly). We implement two versions of FlyIT, learning at image-level and gene-level, respectively. The gene-level version employs an image tiling strategy to get a combined image feature representation for each gene. FlyIT uses a pre-trained ResNet model to obtain feature representation and a new loss function to deal with the class imbalance problem. As the annotation of Drosophila images is a multi-label classification problem, the new loss function considers the difficulty levels for recognizing different labels of the same sample and adjusts the sample weights accordingly. The experimental results on the FlyExpress database show that both the image tiling strategy and the deep architecture lead to the great enhancement of the annotation performance. FlyIT outperforms the existing annotators by a large margin (over 9 percent on AUC and 12 percent on macro F1 for predicting the top 10 terms). It also shows advantages over other deep learning models, including both single-instance and multi-instance learning frameworks.
Collapse
|
2
|
Zhang W, Li R, Zeng T, Sun Q, Kumar S, Ye J, Ji S. Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis. IEEE TRANSACTIONS ON BIG DATA 2020; 6:322-333. [PMID: 36846743 PMCID: PMC9957557 DOI: 10.1109/tbdata.2016.2573280] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
A central theme in learning from image data is to develop appropriate representations for the specific task at hand. Thus, a practical challenge is to determine what features are appropriate for specific tasks. For example, in the study of gene expression patterns in Drosophila, texture features were particularly effective for determining the developmental stages from in situ hybridization images. Such image representation is however not suitable for controlled vocabulary term annotation. Here, we developed feature extraction methods to generate hierarchical representations for ISH images. Our approach is based on the deep convolutional neural networks that can act on image pixels directly. To make the extracted features generic, the models were trained using a natural image set with millions of labeled examples. These models were transferred to the ISH image domain. To account for the differences between the source and target domains, we proposed a partial transfer learning scheme in which only part of the source model is transferred. We employed multi-task learning method to fine-tune the pre-trained models with labeled ISH images. Results showed that feature representations computed by deep models based on transfer and multi-task learning significantly outperformed other methods for annotating gene expression patterns at different stage ranges.
Collapse
Affiliation(s)
- Wenlu Zhang
- Department of Computer Science, Old Dominion University, Norfolk, VA, 23529
| | - Rongjian Li
- Department of Computer Science, Old Dominion University, Norfolk, VA, 23529
| | - Tao Zeng
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99163
| | - Qian Sun
- Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, 85287
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine and the Department of Biology, Temple University, Philadelphia, PA 19122
| | - Jieping Ye
- Department of Electrical Engineering and Computer Science and the Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - Shuiwang Ji
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99163
| |
Collapse
|
3
|
Yang Y, Fang Q, Shen HB. Predicting gene regulatory interactions based on spatial gene expression data and deep learning. PLoS Comput Biol 2019; 15:e1007324. [PMID: 31527870 PMCID: PMC6764701 DOI: 10.1371/journal.pcbi.1007324] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Revised: 09/27/2019] [Accepted: 08/08/2019] [Indexed: 11/23/2022] Open
Abstract
Reverse engineering of gene regulatory networks (GRNs) is a central task in systems biology. Most of the existing methods for GRN inference rely on gene co-expression analysis or TF-target binding information, where the determination of co-expression is often unreliable merely based on gene expression levels, and the TF-target binding data from high-throughput experiments may be noisy, leading to a high ratio of false links and missed links, especially for large-scale networks. In recent years, the microscopy images recording spatial gene expression have become a new resource in GRN reconstruction, as the spatial and temporal expression patterns contain much abundant gene interaction information. Till now, the spatial expression resources have been largely underexploited, and only a few traditional image processing methods have been employed in the image-based GRN reconstruction. Moreover, co-expression analysis using conventional measurements based on image similarity may be inaccurate, because it is the local-pattern consistency rather than global-image-similarity that determines gene-gene interactions. Here we present GripDL (Gene regulatory interaction prediction via Deep Learning), which incorporates high-confidence TF-gene regulation knowledge from previous studies, and constructs GRNs for Drosophila eye development based on Drosophila embryonic gene expression images. Benefitting from the powerful representation ability of deep neural networks and the supervision information of known interactions, the new method outperforms traditional methods with a large margin and reveals new intriguing knowledge about Drosophila eye development.
Collapse
Affiliation(s)
- Yang Yang
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
- Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| | - Qingwei Fang
- School of Bio-medical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
4
|
Yang Y, Zhou M, Fang Q, Shen HB. AnnoFly: annotating Drosophila embryonic images based on an attention-enhanced RNN model. Bioinformatics 2019; 35:2834-2842. [DOI: 10.1093/bioinformatics/bty1064] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2018] [Revised: 12/01/2018] [Accepted: 12/27/2018] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation
In the post-genomic era, image-based transcriptomics have received huge attention, because the visualization of gene expression distribution is able to reveal spatial and temporal expression pattern, which is significantly important for understanding biological mechanisms. The Berkeley Drosophila Genome Project has collected a large-scale spatial gene expression database for studying Drosophila embryogenesis. Given the expression images, how to annotate them for the study of Drosophila embryonic development is the next urgent task. In order to speed up the labor-intensive labeling work, automatic tools are highly desired. However, conventional image annotation tools are not applicable here, because the labeling is at the gene-level rather than the image-level, where each gene is represented by a bag of multiple related images, showing a multi-instance phenomenon, and the image quality varies by image orientations and experiment batches. Moreover, different local regions of an image correspond to different CV annotation terms, i.e. an image has multiple labels. Designing an accurate annotation tool in such a multi-instance multi-label scenario is a very challenging task.
Results
To address these challenges, we develop a new annotator for the fruit fly embryonic images, called AnnoFly. Driven by an attention-enhanced RNN model, it can weight images of different qualities, so as to focus on the most informative image patterns. We assess the new model on three standard datasets. The experimental results reveal that the attention-based model provides a transparent approach for identifying the important images for labeling, and it substantially enhances the accuracy compared with the existing annotation methods, including both single-instance and multi-instance learning methods.
Availability and implementation
http://www.csbio.sjtu.edu.cn/bioinf/annofly/
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Yang
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
- Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| | - Mingyu Zhou
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Qingwei Fang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|
5
|
Jug F, Pietzsch T, Preibisch S, Tomancak P. Bioimage Informatics in the context of Drosophila research. Methods 2014; 68:60-73. [PMID: 24732429 DOI: 10.1016/j.ymeth.2014.04.004] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Revised: 04/02/2014] [Accepted: 04/04/2014] [Indexed: 01/05/2023] Open
Abstract
Modern biological research relies heavily on microscopic imaging. The advanced genetic toolkit of Drosophila makes it possible to label molecular and cellular components with unprecedented level of specificity necessitating the application of the most sophisticated imaging technologies. Imaging in Drosophila spans all scales from single molecules to the entire populations of adult organisms, from electron microscopy to live imaging of developmental processes. As the imaging approaches become more complex and ambitious, there is an increasing need for quantitative, computer-mediated image processing and analysis to make sense of the imagery. Bioimage Informatics is an emerging research field that covers all aspects of biological image analysis from data handling, through processing, to quantitative measurements, analysis and data presentation. Some of the most advanced, large scale projects, combining cutting edge imaging with complex bioimage informatics pipelines, are realized in the Drosophila research community. In this review, we discuss the current research in biological image analysis specifically relevant to the type of systems level image datasets that are uniquely available for the Drosophila model system. We focus on how state-of-the-art computer vision algorithms are impacting the ability of Drosophila researchers to analyze biological systems in space and time. We pay particular attention to how these algorithmic advances from computer science are made usable to practicing biologists through open source platforms and how biologists can themselves participate in their further development.
Collapse
Affiliation(s)
- Florian Jug
- Max Planck Institute of Molecular Cell Biology and Genetics, 01307 Dresden, Germany
| | - Tobias Pietzsch
- Max Planck Institute of Molecular Cell Biology and Genetics, 01307 Dresden, Germany
| | - Stephan Preibisch
- Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, VA 20147, USA; Department of Anatomy and Structural Biology, Gruss Lipper Biophotonics Center, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Pavel Tomancak
- Max Planck Institute of Molecular Cell Biology and Genetics, 01307 Dresden, Germany.
| |
Collapse
|
6
|
Zhang W, Feng D, Li R, Chernikov A, Chrisochoides N, Osgood C, Konikoff C, Newfeld S, Kumar S, Ji S. A mesh generation and machine learning framework for Drosophila gene expression pattern image analysis. BMC Bioinformatics 2013; 14:372. [PMID: 24373308 PMCID: PMC3879658 DOI: 10.1186/1471-2105-14-372] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2013] [Accepted: 12/16/2013] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Multicellular organisms consist of cells of many different types that are established during development. Each type of cell is characterized by the unique combination of expressed gene products as a result of spatiotemporal gene regulation. Currently, a fundamental challenge in regulatory biology is to elucidate the gene expression controls that generate the complex body plans during development. Recent advances in high-throughput biotechnologies have generated spatiotemporal expression patterns for thousands of genes in the model organism fruit fly Drosophila melanogaster. Existing qualitative methods enhanced by a quantitative analysis based on computational tools we present in this paper would provide promising ways for addressing key scientific questions. RESULTS We develop a set of computational methods and open source tools for identifying co-expressed embryonic domains and the associated genes simultaneously. To map the expression patterns of many genes into the same coordinate space and account for the embryonic shape variations, we develop a mesh generation method to deform a meshed generic ellipse to each individual embryo. We then develop a co-clustering formulation to cluster the genes and the mesh elements, thereby identifying co-expressed embryonic domains and the associated genes simultaneously. Experimental results indicate that the gene and mesh co-clusters can be correlated to key developmental events during the stages of embryogenesis we study. The open source software tool has been made available at http://compbio.cs.odu.edu/fly/. CONCLUSIONS Our mesh generation and machine learning methods and tools improve upon the flexibility, ease-of-use and accuracy of existing methods.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - Shuiwang Ji
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA.
| |
Collapse
|
7
|
Sun Q, Muckatira S, Yuan L, Ji S, Newfeld S, Kumar S, Ye J. Image-level and group-level models for Drosophila gene expression pattern annotation. BMC Bioinformatics 2013; 14:350. [PMID: 24299119 PMCID: PMC3924186 DOI: 10.1186/1471-2105-14-350] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2013] [Accepted: 11/06/2013] [Indexed: 12/27/2022] Open
Abstract
Background Drosophila melanogaster has been established as a model organism for investigating the developmental gene interactions. The spatio-temporal gene expression patterns of Drosophila melanogaster can be visualized by in situ hybridization and documented as digital images. Automated and efficient tools for analyzing these expression images will provide biological insights into the gene functions, interactions, and networks. To facilitate pattern recognition and comparison, many web-based resources have been created to conduct comparative analysis based on the body part keywords and the associated images. With the fast accumulation of images from high-throughput techniques, manual inspection of images will impose a serious impediment on the pace of biological discovery. It is thus imperative to design an automated system for efficient image annotation and comparison. Results We present a computational framework to perform anatomical keywords annotation for Drosophila gene expression images. The spatial sparse coding approach is used to represent local patches of images in comparison with the well-known bag-of-words (BoW) method. Three pooling functions including max pooling, average pooling and Sqrt (square root of mean squared statistics) pooling are employed to transform the sparse codes to image features. Based on the constructed features, we develop both an image-level scheme and a group-level scheme to tackle the key challenges in annotating Drosophila gene expression pattern images automatically. To deal with the imbalanced data distribution inherent in image annotation tasks, the undersampling method is applied together with majority vote. Results on Drosophila embryonic expression pattern images verify the efficacy of our approach. Conclusion In our experiment, the three pooling functions perform comparably well in feature dimension reduction. The undersampling with majority vote is shown to be effective in tackling the problem of imbalanced data. Moreover, combining sparse coding and image-level scheme leads to consistent performance improvement in keywords annotation.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Jieping Ye
- Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA.
| |
Collapse
|
8
|
Puniyani K, Xing EP. GINI: from ISH images to gene interaction networks. PLoS Comput Biol 2013; 9:e1003227. [PMID: 24130465 PMCID: PMC3794902 DOI: 10.1371/journal.pcbi.1003227] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2012] [Accepted: 07/30/2013] [Indexed: 12/22/2022] Open
Abstract
Accurate inference of molecular and functional interactions among genes, especially in multicellular organisms such as Drosophila, often requires statistical analysis of correlations not only between the magnitudes of gene expressions, but also between their temporal-spatial patterns. The ISH (in-situ-hybridization)-based gene expression micro-imaging technology offers an effective approach to perform large-scale spatial-temporal profiling of whole-body mRNA abundance. However, analytical tools for discovering gene interactions from such data remain an open challenge due to various reasons, including difficulties in extracting canonical representations of gene activities from images, and in inference of statistically meaningful networks from such representations. In this paper, we present GINI, a machine learning system for inferring gene interaction networks from Drosophila embryonic ISH images. GINI builds on a computer-vision-inspired vector-space representation of the spatial pattern of gene expression in ISH images, enabled by our recently developed system; and a new multi-instance-kernel algorithm that learns a sparse Markov network model, in which, every gene (i.e., node) in the network is represented by a vector-valued spatial pattern rather than a scalar-valued gene intensity as in conventional approaches such as a Gaussian graphical model. By capturing the notion of spatial similarity of gene expression, and at the same time properly taking into account the presence of multiple images per gene via multi-instance kernels, GINI is well-positioned to infer statistically sound, and biologically meaningful gene interaction networks from image data. Using both synthetic data and a small manually curated data set, we demonstrate the effectiveness of our approach in network building. Furthermore, we report results on a large publicly available collection of Drosophila embryonic ISH images from the Berkeley Drosophila Genome Project, where GINI makes novel and interesting predictions of gene interactions. Software for GINI is available at http://sailing.cs.cmu.edu/Drosophila_ISH_images/ As high-throughput technologies for molecular abundance profiling are becoming more inexpensive and accessible, computational inference of gene interaction networks from such data based on well-founded statistical principles is imperative to advance the understanding of regulatory mechanisms in various biological systems. Reverse engineering of gene networks has traditionally relied on analysis of whole-genome microarray data; here we present a new method, GINI, to infer gene networks from ISH images, thereby enabling exploration of spatial characteristics of gene expressions for network inference. Our method generates a Markov network, which encapsulates globally meaningful statistical-dependencies from vector-valued gene spatial patterns. In other words, we advance the state-of-art in both the usage of richer forms of expression data, and the employment of principled statistical methodology for sound network inference on such new form of data. Our results show that analyzing the spatial distribution of gene expression enables us to capture information not available from microarray data. Such an analysis is especially important in analyzing genes involved in embryonic development of Drosophila to reveal specific spatial patterning that determines the development of the 14 segments of the adult fly.
Collapse
Affiliation(s)
- Kriti Puniyani
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Eric P. Xing
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
9
|
Ye J, Liu J. Sparse Methods for Biomedical Data. SIGKDD EXPLORATIONS : NEWSLETTER OF THE SPECIAL INTEREST GROUP (SIG) ON KNOWLEDGE DISCOVERY & DATA MINING 2012; 14:4-15. [PMID: 24076585 PMCID: PMC3783968 DOI: 10.1145/2408736.2408739] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Following recent technological revolutions, the investigation of massive biomedical data with growing scale, diversity, and complexity has taken a center stage in modern data analysis. Although complex, the underlying representations of many biomedical data are often sparse. For example, for a certain disease such as leukemia, even though humans have tens of thousands of genes, only a few genes are relevant to the disease; a gene network is sparse since a regulatory pathway involves only a small number of genes; many biomedical signals are sparse or compressible in the sense that they have concise representations when expressed in a proper basis. Therefore, finding sparse representations is fundamentally important for scientific discovery. Sparse methods based on the [Formula: see text] norm have attracted a great amount of research efforts in the past decade due to its sparsity-inducing property, convenient convexity, and strong theoretical guarantees. They have achieved great success in various applications such as biomarker selection, biological network construction, and magnetic resonance imaging. In this paper, we review state-of-the-art sparse methods and their applications to biomedical data.
Collapse
Affiliation(s)
- Jieping Ye
- Arizona State University Tempe, AZ 85287
| | - Jun Liu
- Siemens Corporate Research Princeton, NJ 08540
| |
Collapse
|