Difference between revisions of "Plant Structure Ontology Principles"
Line 142: | Line 142: | ||
The strategy we have followed will become more difficult (if not impossible) to apply once the PO is extended to include other species. Our decision to synonymize most inflorescence types and all fruit types allows the PO to move forward and to be implemented, but does not actually solve the underlying tension between precision of annotation versus uncontrolled proliferation of terms. | The strategy we have followed will become more difficult (if not impossible) to apply once the PO is extended to include other species. Our decision to synonymize most inflorescence types and all fruit types allows the PO to move forward and to be implemented, but does not actually solve the underlying tension between precision of annotation versus uncontrolled proliferation of terms. | ||
+ | |||
+ | [[Category:Documents]] | ||
+ | [[Category:SOP]] |
Latest revision as of 02:20, 4 December 2008
Plant Ontology: Principles and rationales
Objectives
The main objective of the POC project is to create a set of defined terms that can be uniformly applied to describe the anatomy and morphology of flowering plants, providing a semantic framework for meaningful cross-species queries across databases. In order to make meaningful queries, the terms themselves must be organized in a way that reflects their known biological relationships. The purpose of such structure is to integrate existing species-specific vocabulary terms into a unified ontology for flowering plants, which will facilitate functional annotation efforts, such as annotation of gene expression data and phenotypes.
The first task of the POC is to efficiently integrate the diverse vocabularies currently in use to describe Arabidopsis, maize and rice anatomy and morphology. Thus, the first - version of Plant Structure Ontology spans two major taxonomic divisions: monocots and dicots. In coming years, POC will extend this controlled vocabulary to encompass legumes, Solanaceae and other plant families.
It is important to emphasize the ontologies are not an extensive collection of botanical terms, but rather a complex hierarchical structure in which botanical concepts are described by their meaning and by their relationships to each other. Educational aspect of the plant ontology is to some extent limited; this is imposed by the structure of the ontology itself and the limitations of the current software.
Organizing principles and rationales
The Plant Ontology represents a first step toward a unified vocabulary for flowering plants. Plant phenotypic descriptors (e.g. gynoecium, leaf) are often common English words that have been applied with varying degrees of precision; the same term can be applied to quite different structures (e.g., floret in Compositae and in Poaceae), or conversely different terms can be applied to quite similar structures (e.g, some legumes, follicles). To simplify the problem, we have focused on only a few model organisms (Arabidopsis, rice and maize), even though the ultimate goal is to incorporate all flowering plants. Current tools for constructing ontologies are strictly hierarchical, whereas description of a phenotype is fundamentally non-hierarchical, creating a tension built in between the lack of hierarchy in the terms and the formal hierarchy of the ontology. We recognize that this problem will get worse as more taxa are added. The current version of the Ontology is thus meant to be a stepping stone, so that annotation of genes can proceed, but is far from a final product.
I. General considerations
The following principles were adopted by POC ontology developers:
- To create a biologically accurate ontology, while at the same time keeping in mind its applications as a driving force (i.e., annotations and query results). We have come to realization that the practical use of the ontology, for both annotation and querying, in many ways will dictate its structure.
- To keep the ontology simple, we feel it is important to avoid the tendency to be too inclusive, leading to a massive proliferation of the terms. Such over-population of terms defeats the purpose of having a simple, broadly applicable ontology.
- Relationships between entities are not defining. In order to avoid the 'term proliferation' we consider all 'generic' instances of terms to include all possible components. Therefore, for certain high nodes (for instance, inflorescence), this required us to compromise a great deal the biology itself, i.e., to 'ignore' the enormous morphological diversity of these structures in the flowering plants, and to take a full advantage of different types of synonymy instead (see the Rationale for Inflorescence and Fruit node).
- Avoid creation of 'species specific' terminology. Rather than create species-specific terms we choose to take advantage of filtering options available in the current ontology browsing software (allowing for species specific queries).
- Reuse existing tools and resources as much as possible. For the most part of the plant ontology, we have adopted the current structure of Gene Ontology (GO), as well as the software tools developed by Gene Ontology Consortium (GOC). However, considering that the PO is 'generic' plant anatomical ontology, while GO operates through three biological domains (molecular function, biological process and cellular component), certain differences were unavoidable, such as the 'sometimes a_part -_of' and develops_from' relationship types.
II. Ontology structure rationale
What constitutes a term in the PO?
The following four criteria are considered when creating terms: morphology, anatomy, derivation and position. 'Generic' terms describing anatomical parts, spanning from organs, tissues to cell types are generally included in PO. Also, a number of 'grouping' higher-node terms are created with the purpose of classifying main 'branches' of the ontology (terms such as infructescence or phyllome).
Subcellular structures are excluded from PO (terms like filiform aparatus, sieve plate, primary endosperm nucleus); these terms belong to the cellular components of the Gene Ontology. The following terms are exceptions: pollen tube (PO:0006345) and polar nucleus (PO:0020095). As a general rule for the exceptions, we consider any botanical term depicting a subcellular structure that are currently missing in Gene Ontology AND are required for the annotation purposes.
Attributes of the anatomical parts are to large extent avoided in the PO. Examples: obsoleted terms 'lacunar collenchyma' and 'lamellar collenchyma', both attributes of the term 'collenchyma'.
True path rule
The true path rule (TPR) states that "the pathway from a child term all the way up to its top-level parent(s) must always be true". One of the implications of this is that the type of part_of relationship used in PO (outlined in more details in the developers style guide), is restricted to those types where a child term must always be part_of its parent. However, as an example of enormous morphological diversity of the flowering plants, in the inflorescence node, no floral part is always necessarily part of every instance of the flower in all flowering plants (i.e., always part_of). Therefore, part_of relationship type in PO is adopted as 'sometimes part_of', in which case the TPR does not hold in a strict sense. The third relationship type, 'develops_from' is a more radical example of violations of the true path rule. One example is term 'axial cell' (PO:0000081). This cell type is not a meristematic, but rather it is differentiated cell of the vascular system; however, it occurs as a child of meristematic cell. Fully aware of the limitation of this relationship type, PO developers agreed that 'develops_from' relationship (and consequently gene annotations associated with terms that have this relationship) should not be propagated beyond the first parental node (in this case, fusiform initial):
% cell % meristematic cell % initial cell % cambial initial % fusiform initial ~ axial cell
Issue of granularity (synonyms, instantiations, species-specific terms, 'sensu' terms)
Species-specific terms are included as separate identities only when there are required for annotation purposes. In many cases, granular terms are included as synonyms of the generic term, e.g., instances were 'converted' into synonyms, The best example is term 'inflorescence' (PO:0009049) which currently has 14 synonyms (cob, cyme, panicle, raceme, etc.). The same rationale is used for species-specific terms. Therefore, we have taken a full advantage of different concepts of synonymy in the ontology, previously described by GO.
However, in some cases, species-specific terms are necessary to accommodate gene annotation. In such cases, extensive instantiation was required. Since a node should never be more species-specific than any of its children, when creating more granular terms, special care was taken to make sure that generic parent exist. The current GO structure prohibits use of the same generic term under multiple instances of general terms. Instead, each use of the generic term must be specified as a particular instance such that the hierarchy above it is embedded in the term itself (i.e., child nodes can be at the same level of specificity as the parent node(s), or more specific).
'Sensu' terms
Taxon-specific high nodes are generally avoided. To avoid massive and unnecessary proliferation of sensu terms, the decision was made to include 'sensu' terms in very few special cases. Best examples are terms 'floret' and 'floret' (sensu Poaceae). By our current convention, any generic term applicable for broader range of flowering plants or terms common for the three 'model' species (Arabidopsis, maize and rice) can be included in PO (if such term is required). However, there are cases where a term has different meanings when applied to different taxons. Such terms are distinguished from one another by their definitions and by the sensu designation (sensu means 'in the sense of'), for instance, the term floret (sensu Poaceae). Using the 'sensu' reference makes the node available to other species that use the same term. A node should be divided into sensu sub-trees where the children are or are likely to be different. Since grass floret is different than that of Asteraceae, therefore, term floret (PO:0009082) was instantiated to floret (sensu Poaceae). Furthermore, sub-tree is generated by creating another, more granular child term: ear floret because the ear floret in maize (with two instances of upper and lower florets) is different from florets in other grasses. Consequently, all the children terms for the three instances of florets had to be instantiated as well.
% flower % floret % floret (sensu Poacea) % ear floret
Coverage extent
Granular botanical terms that are not used/needed for gene annotations and for gene expression data were excluded (examples are obsoleted terms: pyrene, contact cell, haustorial root).
Cell type terms (grouped under separate cell node) were not propagated to the respective tissue nodes, to avoid redundancy and allow for easier browsing of the ontology. Exceptions were made when necessary to accommodate gene annotations (terms: guard cell, PO:0000256; root hair cell, PO:0000293).
Limitations of the current software
Major compromises were made to make the structure of PO simpler and more 'compliant' to the current annotation methodology and available ontology browsing and editing tools.
III. The basic structures of the plant ontology (3 patterns)
We realized that a single 'pattern' for ontology structure could not be followed consistently throughout the ontology without proliferating a large number of terms and running into absurd situations (due to the nature of the subjects we are dealing, i.e., morphology/anatomy of flowering plants). Therefore, we adopted 3 structures that have been used interchangeably, as needed. In 'Structure 2' (to be used as default), we decided to make extensive use of synonymy and species-specific filtering options provided in AmiGO browser.
Structure 1:
In this model, instances of higher-node terms are added as needed (if they cannot be merged as synonyms, which is 'dictated' by annotation requirements). Therefore, 'part_of' child is added under specific instance of the term as well as an instance under generic term, with all the children (not shown).
% fruit (synonyms: achene, capsule...) < seed (generic term) < berry seed % berry < berry seed
Structure 2 (default):
All the instances of fruit were made synonyms of fruit (creating multiple synonyms of a single higher-node term). This eliminates or largely reduces a need for term proliferation (i.e., instantiations of the granular 'part_of' children terms under each instance of the term). Children are added only to the generic term 'seed' (not shown).
% fruit (synonyms: achene, capsule) < seed (generic term)
The best example is indeed the fruit node, which now has multiple synonyms and no instances.
Structure 3:
This model implies full instantiation, i.e., having instances under the generic term seed and also adding new, more granular terms under each instance of fruit. To reduce a massive proliferation of terms, only instances that are needed (for annotations purposes) are included.
% fruit < seed (generic term) % berry seed % berry < berry seed % capsule < berry seed
The best use-case where this structure was necessary is the root node with elongation zone, where a generic term (a) was instantiated with two additional terms (b and c):
% root < elongation zone (PO:0020125) (a) % primary root elongation zone % lateral root elongation zone % primary root < primary root elongation zone (b) % lateral root < lateral root elongation zone (c)
IV. Organization of the top nodes (sporophyte and gametophyte, cell and tissue)
There are only four top nodes under the main plant structure node: cell and tissue nodes, gametophyte and sporophyte. To avoid redundancy, we did not include a specific organ node to group all organs, although the individual organs are included within the ontology. The two top nodes, sporophyte and gametophyte are separated since they represent diploid and haploid generation of the plant life cycle. The largest node, sporophyte, was also simplified (only seed plants were considered); it now has only four children: seed (as an instance_of), and root, shoot and infructescence (as a part_of). Infructescence node, originally placed under shoot node, was moved up one level to avoid violation of the 'true path rule' that occurred because some the children of the term 'seed' (for instance, radicle) appeared as children of the term 'shoot'. (Following the true path rule, any path from the most granular term all the way to the top of the ontology must be 'true').
Rationale for inflorescence and fruit terms
The current structure of GO requires that each term used be unique. Because of the many terms for fruits and inflorescences, we found that when we tried to include even a carefully edited list of inflorescence and fruit terms in the Plant Ontology, terms proliferated rapidly. For example, if we listed cyme, panicle, and raceme as instances of inflorescence, then we needed to create three additional terms, flower of cyme, flower of panicle, and flower of raceme as parts of each inflorescence type. Because in all cases, the term flower has its own children (androecium, gynoecium, petals, stamens), special terms then had to be created for each of these (androceium of flower of cyme, androecium of flower of raceme, etc.). In other words, each use of a generic term must be specified as a particular instance such that the hierarchy above it is embedded in the term itself. This process of carrying the hierarchy into the terms themselves then propagates downward, leading potentially to terms such as "microsporangium of theca of anther of androecium of flower of cyme".
To mitigate this problem, we decided to make use of synonymy as much as possible. Thus the terms "cyme," "raceme," "panicle," "spike" all become synonyms of "inflorescence," and fruit types all become synonyms of "fruit", effectively removing one hierarchical level from the ontology. This will still allow the user to find genes that affect cymes, because a search on "cyme" will pull up all inflorescence genes. For cross-species comparisons, generic searches, and coarsely annotated genes, the synonymy will be helpful. In addition, we have deliberately limited our list of inflorescence (and fruit?) terms to the grasses, Arabidopsis, and tomato, ignoring all other plants for the time being.
Synonymy, however, loses information for more detailed searches. Using the PO alone, for example, it would be impossible to find genes expresses only in cymes, or only in spikes. For the taxa currently incorporated into the PO, specificity can be achieved at the moment using a taxonomic filter, which is available in the current Gene Ontology AmiGO browser. Genes from "spikes sensu Triticeae" could be found by searching only among Triticeae genes, genes from "panicles" by searching rice, genes from "racemes" by searching Arabidopsis. The only current genus for which a taxonomic filter will not work is Zea, which has physically separate and morphologically distinct inflorescences. The two sorts of inflorescence often have different phenotypes in single-gene mutants, and identical genes are often deployed differently in each. Maize geneticists thus often want to be able to distinguish these two. Therefore, the maize ear and tassel are the only two inflorescence types that are treated as instances of "inflorescence". This permits annotation of genes and phenotypes that differ between the two inflorescence types.
The strategy we have followed will become more difficult (if not impossible) to apply once the PO is extended to include other species. Our decision to synonymize most inflorescence types and all fruit types allows the PO to move forward and to be implemented, but does not actually solve the underlying tension between precision of annotation versus uncontrolled proliferation of terms.