Genome annotation is a process of attributing structural and functional information to sequences. These annotations range from sequence similarities, biological functions, location of regulatory motifs, expression and interactions.
Recommendations
Target user(s): Data managers, Bioinformaticians
Summary
- Recommended data format: GFF3 format.
- Provide comprehensive content description for column 9 in the GFF3 file.
- Consistent use of external database cross references (Dbxref).
- Consistent use of ontologies.
1. Data format
We recommend GFF3 file format for the representation of genome annotations.
Description
The GFF3 file format is widely used by the community and is a good option for representing genome annotations. However, descriptions with regard to specific columns need attention, for instance, column 9 “attributes” in the GFF3 file varies in the type of information it contains when compared to the rest that are specific (position, chromosome…). The information contained in column 9 needs guidelines (currently ID and Name of the Feature are the mandatory information) — the other attributes are not specified, resulting in adopters using it in different ways.
Specifics
Guidelines for describing content for Column 9 “attributes”:
- ID
- Indicates the ID of the feature. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature.
- Name
- Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.
- Alias
- A secondary name for the feature. It suggests that this tag can be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.
- Parent
- Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth. A feature may have multiple parents. Parent can *only* be used to indicate a part of relationship.
- Target
- Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is “target_id start end [strand]”, where strand is optional and may be “+” or “-“. If the target_id contains spaces, they must be escaped as hex escape %20.
- Gap
- The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is taken from the CIGAR format described in the Exonerate documentation. See “THE GAP ATTRIBUTE” for a description of this format.
- Derives_from
- Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural “part of” one. This is needed for polycistronic genes. See “PATHOLOGICAL CASES” for further discussion.
- Note
- A free text note.
- Dbxref
- A database cross reference. See the section “Ontology Associations and Db Cross References” for details on the format.
- Ontology_term
- A cross reference to an ontology term. See the section “Ontology Associations and Db Cross References” for details.
- Is_circular
- A flag to indicate whether a feature is circular. See extended discussion below.
2. Good practices
- Use homogeneous abbreviation tags for database.
- Consequent use format column 9 – use Dbxref attribute.
Dbxref is the ID of the cross referenced object in the form
DBTAG:ID – The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database. IDs can contain unescaped colons but DBTAGs cannot, so parsing code should split on the first colon encountered in the attribute value.
Here are some suggestions for a homogeneous GFF3-format:
original GFF3 for Hordeum_vulgare (EnsemblPlants):
1 ensembl gene 3656 4845 . – . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1
proposed format is:
1 ensembl gene 3656 4845 . – . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1;Dbxref=UniProt:M0Y5H6,EnsemblPlants:MLOC_65880
3. Metadata and Vocabularies
We recommend the use of ontologies for functional annotation in column 9, such as, Gene Ontology and Sequence Ontology.
4. Tools
Convert data format
You can convert different formats to GFF3 using the Bioconvert tool.
GFF3 validator – Genome tools:
http://genometools.org/cgi-bin/gff3validator.cgi
5. Examples
traes3bPseudomoleculeV1 GDEC marker 82454936 82455352 . - . ID=XwPt1159-3B;Name=XwPt1159-3B;marker=wPt1159;type=darts traes3bPseudomoleculeV1 GDEC marker 771172313 771172855 . - . ID=XwPt2416-3B;Name=XwPt2416-3B;marker=wPt2416;type=darts traes3bPseudomoleculeV1 GDEC marker 12174851 12175713 . + . ID=XwPt2757-3B;Name=XwPt2757-3B;marker=wPt2757;type=darts traes3bPseudomoleculeV1 GDEC marker 586057169 586057670 . - . ID=XwPt3327-3B;Name=XwPt3327-3B;marker=wPt3327;type=darts traes3bPseudomoleculeV1 GDEC marker 295038909 295039410 . - . ID=XwPt3327-3B.2;Name=XwPt3327-3B;marker=wPt3327;type=darts v443_0484 GDEC marker 134945 135646 . + . ID=XwPt4933-3B;Name=XwPt4933-3B;marker=wPt4933;type=darts traes3bPseudomoleculeV1 GDEC marker 755916365 755916938 . - . ID=XwPt5295-3B;Name=XwPt5295-3B;marker=wPt5295;type=darts traes3bPseudomoleculeV1 GDEC marker 236794223 236794836 . + . ID=XwPt5390-3B;Name=XwPt5390-3B;marker=wPt5390;type=darts traes3bPseudomoleculeV1 GDEC marker 749409255 749409819 . + . ID=XwPt5947-3B;Name=XwPt5947-3B;marker=wPt5947;type=darts traes3bPseudomoleculeV1 GDEC marker 736342105 736342613 . - . ID=XwPt7301-3B;Name=XwPt7301-3B;marker=wPt7301;type=darts traes3bPseudomoleculeV1 GDEC marker 614658212 614659360 . + . ID=XwPt7502-3B;Name=XwPt7502-3B;marker=wPt7502;type=darts traes3bPseudomoleculeV1 GDEC marker 765686199 765687128 . + . ID=XwPt7514-3B;Name=XwPt7514-3B;marker=wPt7514;type=darts traes3bPseudomoleculeV1 GDEC marker 765009795 765010398 . + . ID=XwPt8845-3B;Name=XwPt8845-3B;marker=wPt8845;type=darts traes3bPseudomoleculeV1 GDEC marker 9584806 9585578 . + . ID=XwPt8855-3B;Name=XwPt8855-3B;marker=wPt8855;type=darts
Writing: WDI working group Creation date: 02 October 2014 Update: 31 July 2015
No Comments Yet