Annotation Association File Format

From Plant Ontology Wiki
Jump to navigationJump to search

Overview

Collaborating databases and projects provide the POC project a tab delimited file, known informally as a " association file". This file carries links between database objects and PO terms. The database object may represent one of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL, etc. Columns in the file are described below. Here is a sample file containing association from Gramene database.

File Name

po_aspect_objecttype_organism_organization.assoc

aspect: growth/anatomy/development.
objecttype: gene/mutant/germplasm/qtl etc.
organism: is always GENUS e.g. arabidopsis/oryza/zea.
organization: the institute/project which is contributing the association files.

For example:
po_anatomy_gene_arabidopsis_tair.assoc
po_growth_gene_arabidopsis_tair.assoc
po_anatomy_gene_oryza_gramene.assoc
po_growth_gene_oryza_gramene.assoc

File Format

The GO Annotation File (GAF) 2.0 format comprises 17 tab-delimited fields, several of which are not mandatory. This includes two new columns (16 and 17) that were not part of the GAF 1.0 format.

Make sure the column order is strictly followed, including spaces for columns that are left blank.

Also see the Gene Ontology Annotation Format web page for more information.

(* denotes required fields)


Column Content Example
1. * DB GR
2. * DB_Object_ID 0060905
3. * DB_Object_Symbol lrd10
4. Qualifier
5. * PO ID PO:0007014
6. * DB:Reference) PMID:2676709
7. * Evidence IMP
8. With (or) From
9. * Aspect G
10. DB_Object_Name lesion resembling disease-10
11. Synonym) bl5|spotted leaf-4
12.* DB_Object_Type gene
13.* taxon) taxon:4527
14.* Date 20050303
15.* Assigned_by GR
16. Annotation_extension part_of(PO:0028002)
17. Gene Product Form ID UniProtKB:P12345-2

Additional information on can be found at the GO page for the GAF.20 format.

Description of the content

1. DB

  • The database contributing the association file.
  • One of the values in the table of database abbreviations. [Database abbreviations explanation]
  • This field is mandatory, cardinality 1.

This column refers to the database from which the identifier in DB object ID (column 2) is drawn. This is not necessarily the group submitting the file. For example, if a UniProtKB ID is the DB object ID (column 2), DB (column 1) should be UniProtKB.

2. DB_Object_ID

  • A unique identifier in DB (column 1) for the item being annotated.
  • This field is mandatory, cardinality 1.

In GAF 2.0 format, the identifier must reference a top-level primary gene or gene product identifier: either a gene, or a protein that has a 1:1 correspondence to a gene. Identifiers referring to particular protein isoforms or post-translationally cleaved or modified proteins are not legal values in this field.

The DB object ID (column 2) is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).

3. DB_Object_Symbol

  • A (unique and valid) symbol to which DB_Object_ID is matched.
  • Can use ORF name for otherwise unnamed gene or protein.
  • If gene products are annotated, can use gene product symbol if available. Many gene product annotation entries can share a gene symbol.
  • This field is mandatory, cardinality 1.

The DB Object Symbol field should be a symbol that means something to a biologist wherever possible (a gene symbol, for example). It is not an ID or an accession number (DB object ID [column 2] provides the unique identifier), although IDs can be used as a DB Object Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).

4. Qualifier

  • Flags that modify the interpretation of an annotation.
  • One (or more) of NOT, contributes_to, colocalizes_with.
  • This field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to).

5. POid

  • The PO identifier for the term attributed to the DB Object ID.
  • This field is mandatory, cardinality 1.

6. DB:Reference

  • The unique identifier appropriate to DB for the authority for the attribution of the POid to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number..
  • Note that only one reference can be cited on a single line. If a reference has identifiers in more than one database, multiple identifiers can be included on a single line, separated by a pipe character.
  • For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
  • Note: with the current version of the AmiGO browser code, it is impossible to have multiple references for a single association. Whichever one is last in column 6 in the assoc file will be displayed
  • This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. GR:8789|PMID:2676709). But only the last 1 will be displayed.

7. Evidence

  • One of IMP, IGI, IPI, IAGP, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA.
  • This field is mandatory, cardinality 1.
  • See GO Evidence codes

8. With (or) From

  • One of:
    • DB:gene_symbol
    • DB:gene_symbol[allele_symbol]
    • DB:gene_id
    • DB:protein_name
    • DB:sequence_id
    • GO:GO_id
  • This field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1
  • Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information).
  • For cardinality >1 use a pipe to separate entries (e.g. TAIR:Atg111111|TAIR:Atg222222).
  • Note that a gene/locus ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products.
  • 'PO:PO_id' is used only when the evidence code is 'IC', and refers to the PO term(s) used as the basis of a curator inference. In these cases the entry in the DB:Reference (column 6) will be that used to assign the PO term(s) from which the inference is made. This field is mandatory for evidence code IC.
  • The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, PO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with' column for ISS annotations.

9. Aspect

  • Indicates the branch of the PO to which the PO ID (column 5) belongs
  • Either A (plant anatomical entity) or G (plant growth stage and development stage)
  • This field is mandatory; cardinality 1

10. DB_Object_Name

  • Name of the object. e.g. gene or gene product
  • This field is not mandatory, cardinality 0, 1 [white space allowed]

11. Synonym

  • Any aliases. e.g. Gene_symbol [or other text]
  • Note that we strongly recommend that gene synonyms are included in the association file, as this aids the searching of PO.
  • This field is not mandatory, cardinality 0, 1, >1 [white space allowed]

12. DB_Object_Type

  • A description of the type of gene product being annotated.
  • If a Gene Product Form ID (column 17) is supplied, the DB object type will refer to that entity; if no gene product form ID is present, it will refer to the entity that the DB Object Symbol (column 2) is believed to produce and which actively carries out the function or localization described.
  • One of the following: protein_complex; protein; protein_structure; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the Sequence Ontology; germplasm (stock/cultivar); mutant; QTL. If the precise product type is unknown, gene_product should be used.
  • This field is mandatory, cardinality 1

The object type (gene_product, transcript, protein, protein_complex, etc.) listed in the DB Object Type field must match the database entry identified by the gene product form ID, or, if this is absent, the expected product of the DB Object ID. Note that DB Object Type refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the PO term or the evidence on which the annotation is based. For example, if your database entry represents a protein-encoding gene, then protein goes in the DB Object Type column. The text entered in the DB Object Name and DB Object Symbol should refer to the entity in DB Object ID. For example, several alternative transcripts from one gene may be annotated separately, each with the same gene ID in DB Object ID, and specific gene product identifiers in Gene Product Form ID, but list the same gene symbol in the DB Object Symbol column.

13. Taxon

  • Taxonomic identifier(s)
  • For cardinality 1, the ID of the species encoding the gene product.
  • For cardinality 2, the first ID is that of the species encoding the gene product; the second ID is that of the other organism in the interaction, such as the species using the gene product.
  • This field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000).

14. Date

  • Date on which the annotation was made; format is YYYYMMDD
  • This field is mandatory, cardinality 1.

15. Assigned_by

  • The database which made the annotation
  • One of the values in the table of database abbreviations
  • Used for tracking the source of an individual annotation.
  • Default value is value entered in column 1 (DB).
  • Value will differ from column 1 for any that is made by one database and incorporated into another.
  • This field is mandatory, cardinality 1.

16. Annotation_extension

NOTE: Usage specifications for column 16 for the PO is under development. Check with curators before using this column

  • Contains cross references to a PO term that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate PO relationship (for now, part_of or participates_in; use of other relations may be allowed in the future).
  • One or more of: relation(PO:id)
  • Example 1: If a gene product is localized to the leaf tip of a vascular leaf, the PO ID (column 5) would be leaf tip (PO:0025142), and the annotation extension column would contain a cross-reference to part_of vascular leaf (PO:0009025).
  • Example 2: If a gene product is localized in a leaf during senesence, the PO ID (column 5) would be leaf (PO:0009025), and the annotation extension column would contain a cross-reference to participates_in leaf senescence stage (PO:0001054).
  • This field is not mandatory, cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries, e.g., part_of(PO:0009025)|participates_in(PO:0001054).
  • See additional information and discussion on the PO Annotation Extensions (column 16) page.
  • see also Google docs spread sheet with specific suggestions of what terms and relations to put in column 16.
  • you can read more information about column 16 on the GO Wiki

17. Gene Product Form ID

As the DB Object ID (column 2) entry must be a canonical entity—a gene OR an abstract protein that has a 1:1 correspondence to a gene—this field allows the annotation of specific variants of that gene or gene product. Contents will frequently include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.

  • The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2
  • When the Gene Product Form ID (column 17) is filled with a protein identifier, the value in DB Object Type (column 12) must be protein. Protein identifiers can include UniProtKB accession numbers, NCBI NP identifiers or Protein Ontology (PRO) identifiers.
  • When the Gene Product Form ID (column 17) is filled with a functional RNA identifier, the DB Object Type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.
  • This column may be left blank; if so, the value in DB Object Type (column 12) will provide a description of the expected gene product.
  • This field is not mandatory, cardinality 0 or 1.

Note

Several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon), Annotation extension. For PO ids, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)