Applying machine learning to regulatory genomics
Regulatory genomics assays based on high-throughput sequencing have given us unprecedented insight into the regulatory architecture of the cell. ChIP-seq and ChIP-exo allow us to profile TF and histone modification occupancy at high resolution over the entire genome. RNA-seq lets us profile global transcriptional activities. ATAC-seq profiles the genome-wide accessibility landscape, while assays such as ChIA-PET and Hi-C open a window on the three-dimensional structure of the genome. Many of these assays have been adapted to provide single-cell resolution, yielding insight into heterogeneity and dynamics in regulatory systems. Research groups around the world have used these assays and more to characterize regulatory activities in a plethora of cell types and conditions, leading to an explosion in the amount of regulatory genomics data available in public databases.
We specialize in developing machine learning applications that can analyze and integrate massive regulatory genomics datasets to generate biological insight. We use a wide variety of computational approaches to accurately characterize biological events in noisy genomic data, to integrate analyses across disparate multi-omic data types, and to explain the biological sequences and events that underlie cellular phenotypes. Some examples of the types of machine learning approaches used in our research include:
- Generative mixture models for characterizing locations of protein-DNA binding events from ChIP-based assays.
- Dimensionality reduction methods for estimating genome structures from Hi-C data.
- Multi-label regression frameworks for estimating cell-specific sequence motif features.
- Non-parametric Bayesian approaches for multi-omic data integration.
- Topic models for characterizing protein-DNA complexes.
Our most recent directions focus on using neural networks to study transcription factor regulatory activities. Neural network based approaches excel in their abilities to integrate disparate data types while offering unparalleled performance for classification applications.
Recent highlights
1. |
Domain adaptive neural networks improve cross-species prediction of transcription factor binding
K Cochran, D Srivastava, A Shrikumar, A Balsubramani, RC Hardison, A Kundaje, S Mahony
|
2. |
Direct prediction of regulatory elements from partial data without imputation
Y Zhang, S Mahony
|
3. |
Characterizing protein-DNA binding event subtypes in ChIP-exo data
N Yamada, WKM Lai, N Farrell, BF Pugh, S Mahony
|
Characterizing determinants of protein-DNA interactions
We aim to understand why transcription factors (TFs) bind to specific regulatory targets in the genome. The typical TF should be able to bind to millions of sites in the human genome, yet only a small fraction of potential binding sites is bound in a given cell type. Furthermore, a TF’s binding sites can differ dramatically across cell types, allowing the TF to regulate different genes in distinct cellular contexts. Uncovering the mechanisms by which TFs choose their cell-specific binding sites is foundational to our knowledge of gene regulation, and provides a strong basis for finding potential therapies for the many diseases that result from mis-regulation of gene expression.
There are many forces that can affect a TF’s choice of binding targets once it is introduced into the nucleus. The inherent DNA-binding preference of the protein will specify the sites that could potentially be bound. But binding selectivity is further determined by the regulatory environment of the cell: chromatin accessibility, interactions with other regulators, DNA methylation, and histone post-translational modifications all play roles in specifying the TF’s binding sites. These forces are context-specific, which allows the same TF to target different binding sites in different cell types.
Our research develops machine learning approaches for understanding how cell-specific TF binding targets are determined by sequence and the regulatory environment. We develop tools that accurately characterize protein-DNA interactions by leveraging more information from biological assays like ChIP-seq and ChIP-exo. And we apply these approaches to understand how TF binding targets can change across cell types and conditions. Our recent work focuses on how the pre-existing chromatin environment impacts the regulatory activities of newly induced TFs, especially in the context of development and cellular programming.
Recent highlights
1. |
High resolution protein architecture of the budding yeast genome
MJ Rossi, PK Kuntala, WKM Lai, N Yamada, N Badjatia, C Mittal, G Kuzu, K Bocklund, NP Farrell, TR Blanda, JD Mairose, AV Basting, KS Mistretta, DJ Rocco, ES Perkinson, GD Kellogg, S Mahony, BF Pugh
|
2. |
An interpretable bimodal neural network characterizes the sequence and preexisting chromatin predictors of induced transcription factor binding
D Srivastava, B Aydin, EO Mazzoni, S Mahony
|
3. |
Alignment and quantification of ChIP-exo crosslinking patterns reveal the spatial organization of protein-DNA complexes
N Yamada, MJ Rossi, N Farrell, BF Pugh, S Mahony
|
Understanding cell fate decisions
While our computational tools can be applied to a broad range of biological datasets, our efforts are ultimately motivated by understanding how transcription factors establish different cell fates. We work collaboratively with several groups of researchers to understand the regulatory networks that underlie developmental and trans-differentiation systems. In particular, our close and long-running collaboration with Esteban Mazzoni’s group at NYU has examined how TFs determine neuronal cell types. Our collaborative work has illustrated that neuronal subtype specification depends on synergistic interactions between regulatory proteins. We have also shown that related TFs with highly similar DNA-binding preferences can nonetheless drive cell fates in different directions, depending on how they interact with established regulatory environments.
Recent highlights
1. |
Hox binding specificity is directed by DNA sequence preferences and differential abilities to engage inaccessible chromatin
M Bulajić*, D Srivastava*, JS Dasen, H Wichterle, S Mahony, EO Mazzoni
|
2. |
Proneural factors Ascl1 and Neurog2 contribute to neuronal subtype identities by establishing distinct chromatin landscapes
B Aydin, A Kakumanu, M Rossillo, M Moreno-Estelles, G Garipler, N Ringstad, N Flames, S Mahony†, EO Mazzoni†
|
3. |
A multi-step transcriptional and chromatin state cascade underlies motor neuron programming
S Velasco*, MM Ibrahim*, A Kakumanu*, G Garipler, B Aydin, MA Al-Sayegh, A Hirsekorn, F Abdul-Rahman, R Satija, U Ohler†, S Mahony†, EO Mazzoni†
|
FUNDING SOURCES
NSF DBI CAREER 2045500
CAREER: Predicting transcription factor binding dynamics across cell types and species
NIH NIGMS MIRA R35-GM144135
Understanding the predeterminants of transcription factor regulatory activity
PREVIOUS FUNDING SUPPORT
NIH NIGMS R01-GM125722
Genome-wide structural organization of proteins within human gene regulatory complexes (01/2018 – 12/2021)
NIH NIGMS R01-GM121613
A 2D segmentation method for jointly characterizing epigenetic dynamics in multiple cell lines (08/2018 – 07/2021)
NSF DBI 1564466
ABI INNOVATION: Characterizing protein-DNA interactions from high-resolution assays (06/2016 – 05/2020)