The research interests of Professor Basilis Gidas the past eight years have been in transcriptional regulatory networks, signal transduction pathways,and ab initio protein folding, using Bayesian statistics and Chomsky type grammars. The work emphasizes: Myc regulatory networks and pathways in cell-growth, cell proliferation, and apoptosis, using Microarray and ChIp-chip data, and cross-species comparison; finding phosphorylation site motifs via tandem mass spectrometry, and structural information of kinases and substrates; ab initio protein folding using compositional/syntactic representations of proteins.
Past Research Interests
Probability Theory/Mathematical Physics
Probability theory on spaces of generalized functions. Gibbs distributions on spaces of tempered distributions. Construction of 2-D and 3-D quantum field theories. Renormalization of quantum field theory Hamiltonians. Spectral properties of Quantum Hamiltonians. Borel summability of ground states asymptotic expansions.
Elliptic Partial Differential Equations/Differential Geometry
Singular Solutions of the Yang-Mills equations. Free boundary problems and Quark confinement. Symmetry properties, uniqueness, and a priori bounds of solutions of elliptic partial differential equations. Classification of singularities of conformal deformations of Riemannian metrics and other nonlinear elliptic equations.
Present Research Interests
Bayesian Statistics/Computer Vision/Speech Recognition
Metropolis-type Monte Carlo simulation algorithms and simulated annealing. Simulation and optimization via the Langevin equation. Markov Random Field (MRF) estimation and consistency of pseudo-likelihood estimators, and of maximum likelihood estimators from complete or incomplete data. A variational method for estimating MRFs. Nonparametric estimation for continuous-time stochastic processes arising in speech recognition. Object identification via classification trees and stochastic grammars. Renormalization group methods for multiscale/multilevel image processing. Texture representation via MRFs with polynomial interactions. Tracking of moving objects via particle filters. Speech signal representation via nonlinear transformations and wavelets. Classification and clustering of stop consonants via nonlinear transformation and nonlinear discriminant analysis.
Computational Molecular Biology
Probabilistic hierarchical/ syntactic models (analogous to Chomsky grammars) for identifying, representing, and analyzing transcription regulatory networks and signal transduction pathways. Identification of genes regulated directly and indirectly by combining microarray expression data, ChIp-chip data, and cross-species comparison information; identification of downstream pathways through which Myc functions in cell growth, cell-cycle proliferation, and apoptosis. Identifying phosphorylation sites motifs on the basis of tandem mass spectrometry data, protein-protein interactions, and structural information about kinases and substrates. Protein representation and ab initio folding via hierarchical/syntactic (also known as compositional) models.
Current Projects in Computational Molecular Biology
Cellular processes such as cell-cycle, cell proliferation, apoptosis, cell-growth, cell differentiation, genome instability, cellular communication, and responses to external stimuli, are governed by interactions among DNA, proteins, RNAs, and a host of other molecules. Understanding the principles and the regulatory mechanisms underlying these processes is a central goal in biology. Our research addresses two aspects of the problem that have been studied extensively and seem to be within reach: (i) Transcription regulatory networks and downstream pathways through which transcription factors (TFs) function in specific cellular processes, and (ii) Signal Transduction pathways that transmit, process, and integrate external and internal signals. Our research address also structural proteomics especially the
ab initio
protein folding problem. Advances in these problems the past few decades have been made possible by the genome sequencing of several species, and the rapid development of experimental technologies (such as microarrays, tile-arrays, ChIp-chip, real-time PCR, yeast two-hybrid assay, tandem mass spectrometry, NMR, and crystallography) as well as the development of recent tools such as RNAi screening and fluorescent proteins.
A complete understanding of the regulatory networks and signaling pathways entails mathematical/probabilistic models that articulate complex biochemical phenomena, and integrate multiple biological knowledge and experimental data from more than one technology. The models need to represent phenomena at multiple levels. At the local level, the models must articulate the spatio-temporal cooperation and coherence of complex interactions of DNA, proteins, RNAs, and signal transducers, as well as the spatial-temporal distributions and abundance profiles of the molecules; these dependencies underly the regulatory controls that determine, for example, gene expression profiles and cellular decisions such as apoptosis and transitions from one cell-cycle phase to the next. At the global level, the models must articulate
global regularities
or
patterns
that represent the "syntax" or overall architecture of a network, pathway, or 3-D structure of a protein. The precise nature of the global and local aspects of a model is problem dependent. For example, a gene-finding model at the global level must represent the "syntax" of the concept "gene" as a collection of "motifs" or genomic sequences (e.g. TATA box, 5'UTR region, initial exon, alternating exon/intron, 3'UTR, Poly-A tail, intergentic regions, etc)
concatenated
according to precise but "random" rules that allow, for example, absence of TATA box, a single exon, or arbitrary number of exons; at the local level, the model must articulate the local variability of each motif or signal. Similar two level descriptions are necessary for models predicting the secondary structure of rRNAs or the 3-D structure of a protein. In transcription regulatory networks and pathways, the global representation includes the concatenation of a hierarchy of entities, e.g. small motifs or patterns that concatenate to form a module, which concatenate to form larger moduli, which in turn concatenate to form networks.
Bayesian Statistics and probability is a natural framework for designing both the local and global aspects of the models, and for accommodating multiple sources of data. The framework supports powerful computational algorithms such as dynamic programming and Monte Carlo type simulation and optimization algorithms. In many ways, the study of the problems for transcription regulation, signal pathways, and structure of proteins and RNAs, has a great deal of similarity to the study of computer vision, speech recognition, and other cognition problems. Our research aims at exploring existing and developing novel hierarchical/syntactic models similar to Chomsky grammars (that include HMM and context-free-grammars) for articulating the global properties of specific tasks in genomics, proteomics, and structural proteomics. Our current focus is on the following three projects:
1. Myc Network and Pathways:
The c-MYC protein (a Transcription Factor) has been implicated in a number of biological processes including cell-growth, apoptosis, cell proliferation and cancer. It is believed that MYC regulates the expression of about 10-15% of human genes (more than any typical transcription factor). Some of these genes are regulated directly (MYC binds in the promoter or somewhere in the vicinity of a gene), while others are regulated indirectly (MYC regulates directly or indirectly genes of other Transcription Factors which, in turn, regulate a particular gene directly). Myc functions both as a transcription enhancer and a transcription repressor; moreover there are indications that there exist regulated switches between "activation" and "repressor" Myc states, depending on the physiological state of a cell. As an enhancer, typically Myc acts by forming a heterodimer with Max and the Myc/Max dimer binds to canonical E-box; but Myc is known to bind to non-canonical motifs and it is believed to do so through other partners. The Myc/Max dimer does not bind to all canonical E-boxes of the genome, but prefers canonical E-boxes in CpG islands regions; moreover, Myc-bound loci are highly acetylated before binding. Enhancement by Myc/Max is antagonized by Max/Mad and Max/Mnt dimers which bind to same E-boxes as Myc/Max and inhibit transcription. As a repressor, Myc acts via the Myc/Max dimer forming a complex with Miz1 (possibly, with other proteins as well) which binds near the INR point.
Finding the genes targeted by Myc, correlating and quantifying the effect of Myc binding on gene expression level, identifying the
crucial targets of Myc and assigning target genes involved in cell-cycle and apoptosis, are problems of fundamental interest. In our work we study these problems by exploring hierarchical models and employing Bayesian statistics computational algorithms that integrate three types of information or data:(i) Cross-species DNA sequence comparison (especially Human and mouse) to identify genome segments that have been conserved by evolution. Such regions typically have a functional role, and MYC binding sites tend to conserved by evolution; (ii) Chromatin Immunoprecipitation array (ChIp-chip) data; this high-throughput technology localizes MYC (or any specific Transcription Factor) binding sites within 1000-2000 DNA base pairs; we combine this information with known MYC motifs (E-box) and cross-species comparison information to find potential binding sites for MYC via a Monte Carlo type procedure; (iii) Gene expression microarray data; these data are employed to cluster genes into Myc target genes and genes that are not affected by MYC, as well as to group genes according to their expression profiles over time.
2. Signal Transduction Pathways in Mast Cells:
Mast cells have a physiological role (they contribute positively to the immune system), and a pathological role (they play central role in allergies, including asthma). Our project focuses primarily on their pathological role. Upon activation by an allergen, mast cells signaling pathways have three main branches: (a) one towards degranulation (and the associated production of toxic molecules such as histamine), (b) another one towards gene transcription of cytokines and chmokines, and (c) and yet another branch towards production of eicosanoids (lipid type mediators). Tandem mass spectrometry (MS/MS) is the most promising high throughput technology for collecting data for mast signaling, and signaling pathways in general. MS/MS produces time series data for phosphorylated proteins. The project addresses three fundamental mathematical/computational problems: (i) identification of the proteins that participate in the pathways, (ii) clustering of the proteins on the basis of their phosphorylation profiles, and (iii) determining the topology of pathway network and analyzing its dynamical behavior. A key tool for solving problem (i) is a Bayesian statistical model for peptide fragmentation and generation of "theoretical" MS/MS spectra. Clustering of proteins (problem (ii)) is based on generalization of the well-known K-means clustering algorithm; the generalization involves a mixture of Gaussian probability densities whose parameters are learned via the EM algorithm. Problem (iii) is the most challenging, and is far from being fully understood primarily because the current data do not contain sufficient information to determine the topology of the pathway network. For this reason, we focus mainly in a sub-network near the receptor that contains a universal motif or module. To study this sub-network we device a stochastic dynamical system.
3. Ab Initio Protein Folding:
Our program views the
ab initio
prediction of a protein's 3-D structure as a "coding" problem (which is often referred to in biology literature as the "second code" of biology). We believe that THE protein folding problem is analogous to the cortical representation (or "code") of languages, objects, scenes, and actions; these representations are believed to be hierarchical/syntactic or compositional in the sense pioneered by Chomsky. Proteins exhibit some natural hierarchies: atoms combine to form backbones and side chains; animo acids combine to form secondary structure elements, which, in turn, combine to form the overall tertiary and quarternary structures; moreover, helices have beginnings (N-caps or N-termini), cores, and ends (C-caps or C-termini). Beyond these hierarchies, proteins contain motifs or patterns that are central to their function; these include protein motifs (such as Helix-Turn-Helix or b/HLH/Zip patterns) involved in the recognition of DNA binding sites, and phosphorylation site motifs (on substrates) recognized by kinases and other substrate-binding proteins or molecules. Identifying the repertoire of protein motifs, understanding the rules by which they are concatenated in a protein, and understanding their interactions with DNA or other proteins and molecules, are problems far from being understood and may require new experimental techniques and the design of suitable libraries of "peptides". Our research focuses on the design of appropriate
syntactic rules
that articulate
contextual constraints
, such as: secondary structure elements need to be compatible with the hydrophobic core and the hydrophilic exterior of the overall tertiary structure; edge and interior strands in β-sheets have distinct properties; the expected link length between turns depends on the protein class α / α , β / β , α / β (for example, α-helical segments bounded by turns contain twice as many residues as similar β-strand segments). The underpinning probabilistic model for incorporating these hierarchies is a compositional/syntactic model that contains Chomsky's context-free grammars, but is more computationally feasible than context-sensitive grammars. The computational algorithm involves a course-to-fine implementation whereby we start with a simplified representation of proteins and proceed to higher and higher resolution representation where more and more atomic details of protein and solvent are incorporated.
Many grants have been awarded.