Difference between revisions of "Annotation Association File Format"

Revision as of 21:21, 25 August 2011

Overview

Collaborating databases and projects provide the POC project a tab delimited file, known informally as a " association file". This file carries links between database objects and PO terms. The database object may represent one of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL, etc. Columns in the file are described below. Here is a sample file containing association from Gramene database.

File Name

po_aspect_objecttype_organism_organization.assoc

aspect: growth/anatomy/development.
objecttype: gene/mutant/germplasm/qtl etc.
organism: is always GENUS e.g. arabidopsis/oryza/zea.
organization: the isntitute/project which is contributing the association files.

For example:
po_anatomy_gene_arabidopsis_tair.assoc
po_growth_gene_arabidopsis_tair.assoc
po_anatomy_gene_oryza_gramene.assoc
po_growth_gene_oryza_gramene.assoc

File Format

The flat file format comprises 16 tab-delimited fields (see also a further example containing several annotations). Make sure the column order is strictly followed.

(* denotes required fields)

Column	Content	Example
1. *	DB	GR
2. *	DB_Object_ID	GR:0060905
3. *	DB_Object_Symbol	lrd10
4.	Qualifier
5. *	PO ID	PO:0007014
6. *	DB:Reference)	PMID:2676709
7. *	Evidence	IMP
8.	With (or) From
9. *	Aspect	G
10.	DB_Object_Name	lesion resembling disease-10
11.	Synonym)	bl5\|spotted leaf-4
12.*	DB_Object_Type	gene
13.*	taxon)	taxon:4527
14.*	Date	20050303
15.*	Assigned_by	GR
16.	Annotation_extension	part_of PO:0028002

Additional information on can be found at the GO page for the GAF.20 format.

Description of the content

1. DB

The database contributing the association file.
One of the values in the table of database abbreviations. [Database abbreviations explanation]
This field is mandatory, cardinality 1

2. DB_Object_ID

A unique identifier in DB for the item being annotated.
This field is mandatory, cardinality 1.

3. DB_Object_Symbol

A (unique and valid) symbol to which DB_Object_ID is matched.
Can use ORF name for otherwise unnamed gene or protein.
If gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol.
This field is mandatory, cardinality 1

4. Qualifier

Flags that modify the interpretation of an annotation.
One (or more) of NOT, contributes_to, colocalizes_with.
This field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to).

5. POid

The PO identifier for the term attributed to the DB_Object_ID.
This field is mandatory, cardinality 1.

6. DB:Reference

The unique identifier appropriate to DB for the authority for the attribution of the POid to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number..
Note that only one reference can be cited on a single line. If a reference has identifiers in more than one database, multiple identifiers can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. GR:8789|PMID:2676709).

7. Evidence

One of IMP, IGI, IPI, IAGP, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA.
This field is mandatory, cardinality 1.

8. With (or) From

One of:
- DB:gene_symbol
- DB:gene_symbol[allele_symbol]
- DB:gene_id
- DB:protein_name
- DB:sequence_id
- GO:GO_id
This field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1
Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. TAIR:Atg111111|TAIR:Atg222222).
Note that a gene/locus ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products.
'PO:PO_id' is used only when the evidence code is 'IC', and refers to the PO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the PO term(s) from which the inference is made. This field is mandatory for evidence code IC.
The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, PO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with' column for ISS annotations.

9. Aspect

Either A (plant structure) or G (growth stage and development stage)
This field is mandatory; cardinality 1

10. DB_Object_Name

Name of the object. e.g. gene or gene product
This field is not mandatory, cardinality 0, 1 [white space allowed]

11. Synonym

Any aliases. e.g. Gene_symbol [or other text]
This field is not mandatory, cardinality 0, 1, >1 [white space allowed]

12. DB_Object_Type

What kind of thing is being annotated
One of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL etc.
This field is mandatory, cardinality 1

13. Taxon

Taxonomic identifier(s)
For cardinality 1, the ID of the species representing the Object.
For cardinality 2, the first ID is that of the species encoding the gene product; the second ID is that of the species using the gene product.
This field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000)

14. Date

Date on which the annotation was made; format is YYYYMMDD
This field is mandatory, cardinality 1

15. Assigned_by

The database which made the annotation
One of the values in the table of database abbreviations. Database abbreviations explanation
Used for tracking the source of an individual annotation.
Default value is value entered in column 1 (DB).
Value will differ from column 1 for any that is made by one database and incorporated into another.
This field is mandatory, cardinality 1

16. Annotation_extension

NOTE: Usage specifications for column 16 for the PO is under development. Check with curators before using this column

Contains cross references to a PO term that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate PO relationship (for now, part_of or participates_in; use of other relations may be allowed in the future).
One or more of: relation(PO:id)
Example 1: If a gene product is localized to the leaf tip of a vascular leaf, the PO ID (column 5) would be leaf tip (PO:0025142), and the annotation extension column would contain a cross-reference to part_of vascular leaf (PO:0009025).
Example 2: If a gene product is localized in a leaf during senesence, the PO ID (column 5) would be leaf (PO:0009025), and the annotation extension column would contain a cross-reference to participates_in leaf senescence stage (PO:0001054).
This field is not mandatory, cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries, e.g., part_of(PO:0009025)|participates_in(PO:0001054).
See additional information and discussion on the PO Annotation Extensions (column 16) page.

17. Gene Product Form ID

As the DB Object ID (column 2) entry must be a canonical entity—a gene OR an abstract protein that has a 1:1 correspondence to a gene—this field allows the annotation of specific variants of that gene or gene product. Contents will frequently include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.

The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2

When the gene product form ID (column 17) is filled with a protein identifier, the value in DB object type (column 12) must be protein. Protein identifiers can include UniProtKB accession numbers, NCBI NP identifiers or Protein Ontology (PRO) identifiers.

When the gene product form ID (column 17) is filled with a functional RNA identifier, the DB object type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.

This column may be left blank; if so, the value in DB object type (column 12) will provide a description of the expected gene product.

More information and examples are available from the GO wiki page on column 17.

Note

Several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon), Annotation extension. For PO ids, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)

@@ Line 188: / Line 188: @@
 *This field is not mandatory, cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries, e.g., part_of(PO:0009025)|participates_in(PO:0001054).
 *See additional information and discussion on the [[PO Annotation Extensions (column 16)]] page.
+===17. Gene Product Form ID===
+As the DB Object ID (column 2) entry must be a canonical entity—a gene OR an abstract protein that has a 1:1 correspondence to a gene—this field allows the annotation of specific variants of that gene or gene product. Contents will frequently include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.
+The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2
+* When the gene product form ID (column 17) is filled with a protein identifier, the value in DB object type (column 12) must be protein. Protein identifiers can include [http://www.uniprot.org/help/uniprotkb UniProtKB] accession numbers, [http://www.ncbi.nlm.nih.gov/protein NCBI NP] identifiers or [http://pir.georgetown.edu/pro/pro.shtml Protein Ontology (PRO)] identifiers.
+* When the gene product form ID (column 17) is filled with a functional RNA identifier, the DB object type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.
+This column may be left blank; if so, the value in DB object type (column 12) will provide a description of the expected gene product.
+More information and examples are available from the [http://wiki.geneontology.org/index.php/GAF_Col17_GeneProducts#What_goes_in_col_17.3F GO wiki page on column 17].
 ===Note===