PhD Dissertation : On the identification and investigation of homologous gene families, with particular emphasis on the accuracy of multidomain families

Presented orally on 1 August 2012.

Abstract

This dissertation addresses the identification and characterization of homologous gene families in large-scale, genomic data. Particular emphasis is paid to multidomain gene families, as multidomain sequences represent at least half of the sequence universe, but present an especially challenging case for family classification. Often, these sequences are excluded from analyses because they tend to interfere with classification performed with existing methods. This thesis develops the theoretical context for family classification of datasets that contain multidomain sequences, and demonstrates the implementation necessary for performing classification on large data sets.

Five primary results are presented in this work. First, a definition of homology that encompasses the evolutionary scenarios that result in multidomain families is formulated. Second, the techniques and implementation of family classification are presented. The methodology developed takes protein sequence data as input, and, by explicitly considering the evolutionary signal of gene duplication inherent in a sequence similarity network, derives a network that is an accurate estimate of homology. Third, the structure of this network is examined, and compared to the theoretical construct of a network of homology. Fourth, an approach for predicting families from this network is developed. Importantly, a statistical framework is presented for evaluation of the result using a limited set of curated families. Finally, the interplay between domains and the clustering result is examined using an information-theoretic approach.

Dissertation document

PDF

Committee

Dannie Durand (advisor)
Takis Benos
Tanya Berger-Wolf
Russell Schwartz
Mona Singh

Supplementary Material

All software developed as part of this dissertation is supplied under the terms of the GNU General Public License, Version 3. This includes implementations of all methods described in this document, the schema of the relational database, and all code used to produce the data presentation (i.e., figures) in this document. Copyright to all of the above is retained by myself, Jacob Joseph, and Carnegie Mellon University.

Additionally, and central to the use of this data, I will supply the contents of the relational database used as the central data source in this dissertation work. A variety of data sources comprise the source data used in this database, including NCBI, SwissProt, Ygob, and Panther. I place all derived data in the public domain. However, not all of these sources provide a clear license for reuse or distribution. Nonetheless, to the best of my knowledge, these data sources exist to facilitate broad use.

Source code

Much of the software implementation in this dissertation has been produced in the Python programming language. A variety of additional Python packages have been employed, including the following. The numerical packages, Scipy and Numpy, have been used to facilitate efficient numerical computation. The NetworkX has been used for calculation of network metrics. The Biopython package has been used to parse Blast and source database XML formats. Most plots in this document have been produced with Matplotlib; R has been used for the remainder.

The following source code is partitioned into modules that loosely correspond to the chapters of this dissertation. For each method, functionality is divided between a base library and a set of programs that utilize that library.

Python modules

JJutil (JJutil.tar.gz)

Utility libraries.

DurandDB (DurandDB.tar.gz)

Basic interfacing layer between the DurandLab2 SQL database, and all Python code.

JJcluster (JJcluster.tar.gz)

Sequence clustering library and database layer.

JJnetstat (JJnetstat.tar.gz)

Measurement library for sequence networks

Methods

Neighborhood Correlation

See www.neighborhoodcorrelation.org.

durandlab2 (durandlab2.tar.gz)

DurandLab2 SQL database schema, and code for insertion of sequence data

pfam (pfam.tar.gz)

Wrapper to run PFAM models against sequences in the DurandLab2 database.

pfam_run.py

blast (blast.tar.gz)

Wrapper to run Blast against sequences in the DurandLab2 database.

blast_execute.py

seq_networks (seq_networks.tar.gz)

Tools for analysis of sequence network structure.

cluster_eval (cluster_eval.tar.gz)

Tools for performing clustering, and analyzing the result.

domain_eval (domain_eval.tar.gz)

Tools for analysis of the correspondence between domains and clusters using an information-theoretic approach.

misc_utils (misc_utils.tar.gz)

Data ouput and plotting tools.

Durandlab2 SQL database

The relational database structure described in this document has been implemented using Postgresql. The database schema is found in durandlab2_schema.sql, also listed above.

Below, I provide a SQL dump of the DurandLab2 database, in which all source, intermediate, and output data are stored. This database contains all data used in the course of this dissertation work, and contains much more than that reported in the dissertation document. As a consequence, it is quite large. On disk, Postgresql can be expected to require approximately 80GB to store this database.

Please contact me if you have more targeted interests within this data. I'd be happy to provide any subset, or could arrange for a limited account on my SQL server.

DurandLab2_24aug2012.dump
Warning: This file is very large, totaling approximately 51GB.
MD5 Checksum: 37b20511d9dfe5604c9a81d7c97c0075
This dump is compressed output from pg_dump -Fc.

Created: 03 Aug 2012
Last Modified: 30 Aug 2012