Difference between revisions of "Annotation Association File Format"

From Plant Ontology Wiki
Jump to navigationJump to search
Line 166: Line 166:
 
*This field is mandatory, cardinality 1
 
*This field is mandatory, cardinality 1
  
===Assigned_by===
+
===15. Assigned_by===
 
*The database which made the annotation
 
*The database which made the annotation
 
*One of the values in the table of database abbreviations. [http://plantontology.org/docs/dbxref/PO_DBXref.txt Database abbreviations explanation]
 
*One of the values in the table of database abbreviations. [http://plantontology.org/docs/dbxref/PO_DBXref.txt Database abbreviations explanation]
Line 174: Line 174:
 
*This field is mandatory, cardinality 1
 
*This field is mandatory, cardinality 1
  
<br />
+
===16. Annotation_extension===
Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon). For PO ids, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)</blockquote>
+
 
 +
 
 +
 
 +
Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon), Annotation extension. For PO ids, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)
  
 
[[Category:SOP]]
 
[[Category:SOP]]

Revision as of 05:13, 13 June 2011

Overview

Collaborating databases and projects provide the POC project a tab delimited file, known informally as a " association file". This file carries links between database objects and PO terms. The database object may represent one of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL, etc. Columns in the file are described below. Here is a sample file containing association from Gramene database.

File Name

po_aspect_objecttype_organism_organization.assoc

aspect: growth/anatomy/development.
objecttype: gene/mutant/germplasm/qtl etc.
organism: is always GENUS e.g. arabidopsis/oryza/zea.
organization: the isntitute/project which is contributing the association files.

For example:
po_anatomy_gene_arabidopsis_tair.assoc
po_growth_gene_arabidopsis_tair.assoc
po_anatomy_gene_oryza_gramene.assoc
po_growth_gene_oryza_gramene.assoc

File Format

The flat file format comprises 15 tab-delimited fields (see also a further example containing several annotations). Make sure the column order is strictly followed.
* denotes required fields:

Column Content Example
1. * DB GR
2. * DB_Object_ID GR:0060905
3. * DB_Object_Symbol lrd10
4. Qualifier
5. * PO ID PO:0007014
6. * DB:Reference) PMID:2676709
7. * Evidence IMP
8. With (or) From
9. * Aspect G
10. DB_Object_Name lesion resembling disease-10
11. Synonym) bl5|spotted leaf-4
12.* DB_Object_Type gene
13.* taxon) taxon:4527
14.* Date 20050303
15.* Assigned_by GR
16. Annotation_extension PO:0009025


Description of the content

DB

  • The database contributing the association file.
  • One of the values in the table of database abbreviations. [Database abbreviations explanation]
  • This field is mandatory, cardinality 1

DB_Object_ID

  • A unique identifier in DB for the item being annotated.
  • This field is mandatory, cardinality 1.

DB_Object_Symbol

  • A (unique and valid) symbol to which DB_Object_ID is matched.
  • Can use ORF name for otherwise unnamed gene or protein.
  • If gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol.
  • This field is mandatory, cardinality 1

Qualifier

  • Flags that modify the interpretation of an annotation.
  • One (or more) of NOT, contributes_to, colocalizes_with.
  • This field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to).

POid

  • The PO identifier for the term attributed to the DB_Object_ID.
  • This field is mandatory, cardinality 1.

DB:Reference

  • The unique identifier appropriate to DB for the authority for the attribution of the POid to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number..
  • Note that only one reference can be cited on a single line. If a reference has identifiers in more than one database, multiple identifiers can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
  • This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. GR:8789|PMID:2676709).

Evidence

  • One of IMP, IGI, IPI, IAGP, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA.
  • This field is mandatory, cardinality 1.

With (or) From

  • One of:
    • DB:gene_symbol
    • DB:gene_symbol[allele_symbol]
    • DB:gene_id
    • DB:protein_name
    • DB:sequence_id
    • GO:GO_id
  • This field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1
  • Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. TAIR:Atg111111|TAIR:Atg222222).
  • Note that a gene/locus ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products.
  • 'PO:PO_id' is used only when the evidence code is 'IC', and refers to the PO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the PO term(s) from which the inference is made. This field is mandatory for evidence code IC.
  • The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, PO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with' column for ISS annotations.

Aspect

  • Either A (plant structure) or G (growth stage and development stage)
  • This field is mandatory; cardinality 1

DB_Object_Name

  • Name of the object. e.g. gene or gene product
  • This field is not mandatory, cardinality 0, 1 [white space allowed]

Synonym

  • Any aliases. e.g. Gene_symbol [or other text]
  • This field is not mandatory, cardinality 0, 1, >1 [white space allowed]

DB_Object_Type

  • What kind of thing is being annotated
  • One of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL etc.
  • This field is mandatory, cardinality 1

Taxon

  • What kind of thing is being annotated
  • One of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL etc.
  • This field is mandatory, cardinality 1

Date

  • Date on which the annotation was made; format is YYYYMMDD
  • This field is mandatory, cardinality 1

15. Assigned_by

  • The database which made the annotation
  • One of the values in the table of database abbreviations. Database abbreviations explanation
  • Used for tracking the source of an individual annotation.
  • Default value is value entered in column 1 (DB).
  • Value will differ from column 1 for any that is made by one database and incorporated into another.
  • This field is mandatory, cardinality 1

16. Annotation_extension

Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon), Annotation extension. For PO ids, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)