Guide to The BioCyc Database Collection

This document provides an overview of the BioCyc collection of Pathway/Genome Databases.

Although its content is limited at the current time, it will expand over time to cover additional aspects of BioCyc.

The information in this document pertains to all BioCyc databases (DBs), and to most other DBs created using the Pathway Tools software. More detailed information about specific members of the BioCyc family is available as follows:

The BioCyc Databases

The BioCyc collection of Pathway/Genome Databases (PGDBs) provides electronic reference sources on the pathways and genomes of different organisms. The databases (DBs) within the BioCyc collection are organized into tiers according to the amount of manual review and updating they have received.

We encourage scientists to adopt the Tier 2 and Tier 3 DBs for ongoing curation and updating.

BioCyc databases describe organisms with completely sequenced genomes (not all genomes are closed). Most BioCyc databases are for microbes. In addition, BioCyc contains databases for humans; for important model organisms such as yeast, fly, and mouse; and for other organisms whose PGDBs have been developed by groups outside SRI. One reason for collecting all these PGDBs together within BioCyc is to enable the comparative analyses that become possible when multiple PGDBs are available within one site (see Tools → Comparative Analysis). Most microbial PGDBs within BioCyc have been generated computationally by SRI and are regenerated every 6-12 months to take advantage of improvements in our pathway prediction algorithms and in the MetaCyc pathway database. The PGDBs within BioCyc that have been provided by outside groups are updated with variable frequencies. Usually the date on which a PGDB was generated or last updated can be determined by selecting that PGDB as the current PGDB and then viewing the page at Tools → Reports → Summary Statistics or Tools → Reports → History of Updates.

Looking for pathway databases for other organisms? PGDBs have been created outside SRI for many organisms, including microbes, fungi, plants and animals [details].

What Mechanisms Exist for Accessing BioCyc Data?

BioCyc data is accessible in several ways, which are described in more detail on the downloads page.

Important Concepts

This section introduces a number of concepts that are important to understanding PGDBs.

How are Pathway Boundaries Defined?

Pathway boundaries are defined heuristically, using the judgement of expert curators. Curators consider the following aspects of a pathway when defining its boundaries.

The preceding philosophy toward pathway boundary definition contrasts sharply with KEGG maps. KEGG maps are on average 4.2 times larger than BioCyc pathways because KEGG tends to group into a single map multiple biological pathways that converge on a single metabolite [Pathway05].

Super Pathways and Base Pathways

We define a super-pathway as a cluster of related pathways. Typically, a super-pathway consists of a linked set of smaller pathways that share a common metabolite. For example, the super pathway superpathway of phenylalanine, tyrosine, and tryptophan biosynthesis consists of several pathways that converge at the metabolite chorismate.

The components of super-pathways include base pathways (pathways that are not themselves super-pathways), other super-pathways, and individual reactions that have not necessarily been assigned to base pathways. Those reactions typically serve to connect together the component pathways within a super-pathway.

Super-pathways are stored within each BioCyc PGDB -- they are not computed dynamically.

Do We Force a Pathway View of the Metabolic Network?

No. Pathways comprise a level defined on top of the metabolic network. Users can choose to compute with the metabolic (reaction) network directly, ignoring the pathway layer, if they so choose. Note also that some metabolic reactions in most PGDBs are not assigned to any metabolic pathway.

Reaction Direction

How do PGDBs handle reaction direction?

The direction in which a reaction is stored in a PGDB has no implication for the physiological directionality of that reaction. Each reaction is stored as an instance of the Reactions class that includes two slots, Left and Right. It is possible that the reaction is bidirectional; it is possible that the reaction proceeds physiologically in the left-to-right direction, and it is possible that the reaction proceeds in the right-to-left direction.

The equilibrium constraint and change in Gibbs free energy stored for the reaction (if any) refer to the direction of the reaction as stored.

Currently, the best way to query the direction of a reaction is via an internal Pathway Tools Lisp function called get-rxn-direction. In the future, a field will be added to the Pathway Tools schema to record this information.

Reaction Balancing and Protonation State in BioCyc

Background and Motivations

This section addresses the state of reaction mass balance and protonation state of chemical compounds in the BioCyc databases. Because these issues are still evolving and are influenced to a large degree by history, we include a historical discussion of these issues.

Our long-term goal is for all reactions in BioCyc to be fully mass balanced and charge balanced, and for all chemical compounds to be properly protonated at cellular pH. Although in some cases such a treatment may yield reactions or chemical structures that look non-traditional to biochemists, we believe this approach provides the most consistent and correct treatment. In addition, it provides a treatment that will facilitate automatic generation of flux-balance models from PGDBs.

Historically, the chemical structure data within BioCyc databases has been obtained from many different sources, including textbooks, articles from the primary research literature, and downloading from certain open databases. In the early years of the project we developed programs to check the mass balance and element balance of reactions within BioCyc databases. We found that these programs were extremely valuable because identification of unbalanced reactions allowed us to identify errors in both the reaction equations, and in the chemical structures. However, we also found that, because of the diverse sources from which we obtained chemical structure data, the structures were protonated inconsistently. Therefore, for many years we ignored element imbalances due to hydrogen only, while correcting imbalances due to other elements.

In 2008, we began to address the problem of inconsistent protonation to facilitate automatic generation of flux-balance models. Work was completed on ensuring that reactions in the MetaCyc and EcoCyc PGDBs are completely mass-balanced. The first releases of those fully mass-balanced MetaCyc and EcoCyc DBs were version 13.0 in early 2009. In time, other BioCyc PGDBs will become mass balanced as well. For example, because we periodically regenerate the Tier 3 BioCyc PGDBs, the next time these PGDBs are generated from version 13.0 or higher of MetaCyc, they will be based on the consistently protonated compounds, and the fully mass-balanced reactions.

The following sections describe the methodology by which the protonation-state normalization and reaction mass balancing were achieved.

Protonation State Normalization

For a given chemical compound, there can be atoms that will bind a variable number of hydrogen atoms, depending on their chemical structure and the pH of their environment. A term for the isomers of a compound that differ in the number of hydrogens bound to these atoms is proto-isomer. A term for the atoms with variable numbers of bonded hydrogens is the proto-isomerization centers of a compound. Oxygen, sulfur, phosphorus, and nitrogen are examples of typical proto-isomerization centers.

In order to bring a greater degree of consistency to our PGDBs, we protonated (i.e., assigned the correct number of bound hydrogens to the proto-isomerization centers of a compound) the compounds of EcoCyc with a reference pH value of 7.3, using the Marvin (version 5.1.02) computational chemistry software available from ChemAxon, Ltd [1]. The pH value of 7.3 was selected based on a paper on the measurement of cytoplasmic pH of E. coli [2]. In order to easily exchange compound data between MetaCyc and EcoCyc, MetaCyc was also protonated with a reference pH value of 7.3. This step is an approximation since MetaCyc contains reactions and compounds from many organisms and many cellular compartments.

The Marvin software calculates the protonation state of a compound's proto-isomerization centers by first determining their pKa. The pKa of the proto-isomerization centers of a compound were obtained by computing the partial charge distribution. This, in turn, is calculated using a numerical partial differential equation solver, which computes the distribution by means of the structure of the compound, and the known electronegativities of the constituent atoms. Although we have worked with ChemAxon to improve the accuracy of their calculations to match that of experimentally-verified pKa's of many biochemically-relevant compounds, this calculation is still based on an approximation technique, and will not necessarily yield fully correct pKa's for every substance.

Some caveats about our protonation of compounds:

Computational Reaction Balancing for Hydrogen

Once the compounds of EcoCyc and MetaCyc were protonated, all reactions that had a mass-imbalance due only to hydrogen atoms were computationally balanced. This balancing procedure added or removed instances of the proton from the appropriate side of a reaction to achieve mass-balance.

Some caveats about our computational reaction balancing:

Statistics on Reaction Balance and Protonation circa 2009

This table provides information on the small-molecule reaction balance state for both EcoCyc and MetaCyc as of early 2009. The categories below represent reactions that are balanced, unbalanced, and those for which it is not possible to determine the balance state.

Reactions that remain unbalanced are due to non-trivial imbalances (i.e., imbalances not due solely to hydrogens or protons). These imbalances are usually due to omissions or errors in the structures and/or reaction composition obtained from the literature. Our curation staff are actively researching such compounds and reactions and correct the data whenever possible.

For the category of reactions where it is not possible to determine the balance state, these are mainly due to:

 Number of Reactions
EcoCyc: Balanced Reactions801
EcoCyc: Unbalanced Reactions3
EcoCyc: Reactions that cannot be balanced160
MetaCyc: Balanced Reactions5,098
MetaCyc: Unbalanced Reactions317
MetaCyc: Reactions that cannot be balanced1,143

Comparison of BioCyc to Other Pathway Databases

Please see the comparison section of the MetaCyc Guide.

How Do I Learn More About PGDBs and BioCyc?

The following information resources are available.

Acknowledgements

BioCyc is grateful for the following groups:

References

  1. ChemAxon's Marvin software for computational chemistry
  2. Wilks, J.C., Slonczewski, J.L. pH of the cytoplasm and periplasm of Escherichia coli: rapid measurement by green fluorescent protein fluorimetry.
    J Bacteriol. 2007 Aug;189(15):5601-7. Epub 2007 Jun 1.