Annotation Association File Format
Overview
Collaborating databases and projects provide the POC project a tab delimited file, known informally as a " association file". This file carries links between database objects and PO terms. The database object may represent one of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL, etc. Columns in the file are described below. Here is a sample file containing association from Gramene database.
File Name
po_aspect_objecttype_organism_organization.assoc
aspect: growth/anatomy/development.
objecttype: gene/mutant/germplasm/qtl etc.
organism: is always GENUS e.g. arabidopsis/oryza/zea.
organization: the isntitute/project which is contributing the association files.
For example:
po_anatomy_gene_arabidopsis_tair.assoc
po_growth_gene_arabidopsis_tair.assoc
po_anatomy_gene_oryza_gramene.assoc
po_growth_gene_oryza_gramene.assoc
File Format
The flat file format comprises 16 tab-delimited fields (see also a further example containing several annotations). Make sure the column order is strictly followed.
(* denotes required fields)
Column | Content | Example | |
1. * | DB | GR | |
2. * | DB_Object_ID | GR:0060905 | |
3. * | DB_Object_Symbol | lrd10 | |
4. | Qualifier | ||
5. * | PO ID | PO:0007014 | |
6. * | DB:Reference) | PMID:2676709 | |
7. * | Evidence | IMP | |
8. | With (or) From | ||
9. * | Aspect | G | |
10. | DB_Object_Name | lesion resembling disease-10 | |
11. | Synonym) | bl5|spotted leaf-4 | |
12.* | DB_Object_Type | gene | |
13.* | taxon) | taxon:4527 | |
14.* | Date | 20050303 | |
15.* | Assigned_by | GR | |
16. | Annotation_extension | part_of PO:0028002 | |
17. | Gene Product Form ID | UniProtKB:P12345-2 |
Additional information on can be found at the GO page for the GAF.20 format.
Description of the content
1. DB
- The database contributing the association file.
- One of the values in the table of database abbreviations. [Database abbreviations explanation]
- This field is mandatory, cardinality 1
2. DB_Object_ID
- A unique identifier in DB for the item being annotated.
- This field is mandatory, cardinality 1.
3. DB_Object_Symbol
- A (unique and valid) symbol to which DB_Object_ID is matched.
- Can use ORF name for otherwise unnamed gene or protein.
- If gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol.
- This field is mandatory, cardinality 1
4. Qualifier
- Flags that modify the interpretation of an annotation.
- One (or more) of NOT, contributes_to, colocalizes_with.
- This field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to).
5. POid
- The PO identifier for the term attributed to the DB_Object_ID.
- This field is mandatory, cardinality 1.
6. DB:Reference
- The unique identifier appropriate to DB for the authority for the attribution of the POid to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number..
- Note that only one reference can be cited on a single line. If a reference has identifiers in more than one database, multiple identifiers can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
- This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. GR:8789|PMID:2676709).
7. Evidence
- One of IMP, IGI, IPI, IAGP, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA.
- This field is mandatory, cardinality 1.
8. With (or) From
- One of:
- DB:gene_symbol
- DB:gene_symbol[allele_symbol]
- DB:gene_id
- DB:protein_name
- DB:sequence_id
- GO:GO_id
- This field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1
- Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. TAIR:Atg111111|TAIR:Atg222222).
- Note that a gene/locus ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products.
- 'PO:PO_id' is used only when the evidence code is 'IC', and refers to the PO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the PO term(s) from which the inference is made. This field is mandatory for evidence code IC.
- The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, PO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with' column for ISS annotations.
9. Aspect
- Either A (plant structure) or G (growth stage and development stage)
- This field is mandatory; cardinality 1
10. DB_Object_Name
- Name of the object. e.g. gene or gene product
- This field is not mandatory, cardinality 0, 1 [white space allowed]
11. Synonym
- Any aliases. e.g. Gene_symbol [or other text]
- This field is not mandatory, cardinality 0, 1, >1 [white space allowed]
12. DB_Object_Type
- What kind of thing is being annotated
- One of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL etc.
- This field is mandatory, cardinality 1
13. Taxon
- Taxonomic identifier(s)
- For cardinality 1, the ID of the species representing the Object.
- For cardinality 2, the first ID is that of the species encoding the gene product; the second ID is that of the species using the gene product.
- This field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000)
14. Date
- Date on which the annotation was made; format is YYYYMMDD
- This field is mandatory, cardinality 1
15. Assigned_by
- The database which made the annotation
- One of the values in the table of database abbreviations. Database abbreviations explanation
- Used for tracking the source of an individual annotation.
- Default value is value entered in column 1 (DB).
- Value will differ from column 1 for any that is made by one database and incorporated into another.
- This field is mandatory, cardinality 1
16. Annotation_extension
NOTE: Usage specifications for column 16 for the PO is under development. Check with curators before using this column
- Contains cross references to a PO term that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate PO relationship (for now, part_of or participates_in; use of other relations may be allowed in the future).
- One or more of: relation(PO:id)
- Example 1: If a gene product is localized to the leaf tip of a vascular leaf, the PO ID (column 5) would be leaf tip (PO:0025142), and the annotation extension column would contain a cross-reference to part_of vascular leaf (PO:0009025).
- Example 2: If a gene product is localized in a leaf during senesence, the PO ID (column 5) would be leaf (PO:0009025), and the annotation extension column would contain a cross-reference to participates_in leaf senescence stage (PO:0001054).
- This field is not mandatory, cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries, e.g., part_of(PO:0009025)|participates_in(PO:0001054).
- See additional information and discussion on the PO Annotation Extensions (column 16) page.
17. Gene Product Form ID
As the DB Object ID (column 2) entry must be a canonical entity—a gene OR an abstract protein that has a 1:1 correspondence to a gene—this field allows the annotation of specific variants of that gene or gene product. Contents will frequently include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.
- The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2
- When the gene product form ID (column 17) is filled with a protein identifier, the value in DB object type (column 12) must be protein. Protein identifiers can include UniProtKB accession numbers, NCBI NP identifiers or Protein Ontology (PRO) identifiers.
- When the gene product form ID (column 17) is filled with a functional RNA identifier, the DB object type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.
- This column may be left blank; if so, the value in DB object type (column 12) will provide a description of the expected gene product.
- More information and examples are available from the GO wiki page on column 17.
Note
Several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon), Annotation extension. For PO ids, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)