Modeling long-range regulatory interactions to predict gene expression using Graph Convolutional Networks

In this study, we propose to use a graph-based deep learning framework to integrate information about the 3D organization of the DNA and its environment to predict gene expression. Gene regulation is the process of controlling the expression of genes to go high or low. Any disruption in this process can have severe downstream consequences and result in diseases like cancer. Promoters and regulatory elements, like enhancers and repressors, spatiotemporally participate in gene regulation. They have been shown to frequently affect the gene expression from long distances beyond the neighborhood of the transcribing gene. This long-range effect is attributed to the 3D organization of the DNA that can control access to remote gene sites. Performing a genome-wide analysis of such interactions is challenging due to the sheer size of the search space. This bottleneck requires the development of data-driven approaches to capture relevant information. Existing machine learning methods can model the local interactions between regulatory elements to predict the gene expression. However, by focusing on fixed-length regions around the genes, they fail to incorporate the potential long-range interactions that play a crucial role in gene regulation. To overcome these challenges, we propose to apply a Graph Convolutional Network (GCN), a deep learning framework, to integrate the spatial structure of the DNA with signals from regulatory elements. The overall objective of this proposal is to computationally model global (and local) gene regulation from chromatin modification and three-dimensional interaction data. This model can then be used to identify key features that determine which long-range or short-range features drive gene expression at a particular locus. Therefore, the first aim of this project is to develop the graph-based deep learning model that will input the 3D organization of the DNA (as graph G) and values of the regulatory signals (as its node features) to predict gene expression. This task will allow the model to automatically capture the relevant interactions from the data that are correlated with high/low gene expression. The second aim will be to identify these relevant interactions that are predictive of gene expression using interpretation methods. These methods aim to explain which input features are important for a given prediction. Finally, for the third aim, we will validate these learned relevant features using literature survey as well as biological experiments.

Our proposed deep learning framework will capture and identify the important long-range regulatory interactions from the data. A differential analysis of these interactions between healthy and diseased cells will be essential for comprehensively understanding cell development and misregulation in human diseases.