Genome annotations

Genome annotation is a process of attributing structural and functional information to sequences. These annotations range from sequence similarities, biological functions, location of regulatory motifs, expression and interactions.

Recommendations

Target user(s): Data managers, Bioinformaticians

Summary

Recommended data format: GFF3 format.
Provide comprehensive content description for column 9 in the GFF3 file.
Consistent use of external database cross references (Dbxref).
Consistent use of ontologies.

1. Data format

We recommend GFF3 file format for the representation of genome annotations.

Description

The GFF3 file format is widely used by the community and is a good option for representing genome annotations. However, descriptions with regard to specific columns need attention, for instance, column 9 “attributes” in the GFF3 file varies in the type of information it contains when compared to the rest that are specific (position, chromosome…). The information contained in column 9 needs guidelines (currently ID and Name of the Feature are the mandatory information) — the other attributes are not specified, resulting in adopters using it in different ways.

Specifics

Guidelines for describing content for Column 9 “attributes”:

ID: Indicates the ID of the feature. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature.
Name: Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.
Alias: A secondary name for the feature. It suggests that this tag can be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.
Parent: Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth. A feature may have multiple parents. Parent can *only* be used to indicate a part of relationship.
Target: Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is “target_id start end [strand]”, where strand is optional and may be “+” or “-“. If the target_id contains spaces, they must be escaped as hex escape %20.
Gap: The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is taken from the CIGAR format described in the Exonerate documentation. See “THE GAP ATTRIBUTE” for a description of this format.
Derives_from: Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural “part of” one. This is needed for polycistronic genes. See “PATHOLOGICAL CASES” for further discussion.
Note: A free text note.
Dbxref: A database cross reference. See the section “Ontology Associations and Db Cross References” for details on the format.
Ontology_term: A cross reference to an ontology term. See the section “Ontology Associations and Db Cross References” for details.
Is_circular: A flag to indicate whether a feature is circular. See extended discussion below.

2. Good practices

Use homogeneous abbreviation tags for database.

Consequent use format column 9 – use Dbxref attribute.

Dbxref is the ID of the cross referenced object in the form

DBTAG:ID – The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database. IDs can contain unescaped colons but DBTAGs cannot, so parsing code should split on the first colon encountered in the attribute value.

Here are some suggestions for a homogeneous GFF3-format:

original GFF3 for Hordeum_vulgare (EnsemblPlants):

1 ensembl gene 3656 4845 . – . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1

proposed format is:

1 ensembl gene 3656 4845 . – . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1;Dbxref=UniProt:M0Y5H6,EnsemblPlants:MLOC_65880

3. Metadata and Vocabularies

We recommend the use of ontologies for functional annotation in column 9, such as, Gene Ontology and Sequence Ontology.

4. Tools

Convert data format
You can convert different formats to GFF3 using the Bioconvert tool.

GFF3 validator – Genome tools:

http://genometools.org/cgi-bin/gff3validator.cgi

5. Examples

GFF3 sample from the 3B annotation browser:

traes3bPseudomoleculeV1	GDEC	marker	82454936	82455352	.	-	.	ID=XwPt1159-3B;Name=XwPt1159-3B;marker=wPt1159;type=darts
traes3bPseudomoleculeV1	GDEC	marker	771172313	771172855	.	-	.	ID=XwPt2416-3B;Name=XwPt2416-3B;marker=wPt2416;type=darts
traes3bPseudomoleculeV1	GDEC	marker	12174851	12175713	.	+	.	ID=XwPt2757-3B;Name=XwPt2757-3B;marker=wPt2757;type=darts
traes3bPseudomoleculeV1	GDEC	marker	586057169	586057670	.	-	.	ID=XwPt3327-3B;Name=XwPt3327-3B;marker=wPt3327;type=darts
traes3bPseudomoleculeV1	GDEC	marker	295038909	295039410	.	-	.	ID=XwPt3327-3B.2;Name=XwPt3327-3B;marker=wPt3327;type=darts
v443_0484	GDEC	marker	134945	135646	.	+	.	ID=XwPt4933-3B;Name=XwPt4933-3B;marker=wPt4933;type=darts
traes3bPseudomoleculeV1	GDEC	marker	755916365	755916938	.	-	.	ID=XwPt5295-3B;Name=XwPt5295-3B;marker=wPt5295;type=darts
traes3bPseudomoleculeV1	GDEC	marker	236794223	236794836	.	+	.	ID=XwPt5390-3B;Name=XwPt5390-3B;marker=wPt5390;type=darts
traes3bPseudomoleculeV1	GDEC	marker	749409255	749409819	.	+	.	ID=XwPt5947-3B;Name=XwPt5947-3B;marker=wPt5947;type=darts
traes3bPseudomoleculeV1	GDEC	marker	736342105	736342613	.	-	.	ID=XwPt7301-3B;Name=XwPt7301-3B;marker=wPt7301;type=darts
traes3bPseudomoleculeV1	GDEC	marker	614658212	614659360	.	+	.	ID=XwPt7502-3B;Name=XwPt7502-3B;marker=wPt7502;type=darts
traes3bPseudomoleculeV1	GDEC	marker	765686199	765687128	.	+	.	ID=XwPt7514-3B;Name=XwPt7514-3B;marker=wPt7514;type=darts
traes3bPseudomoleculeV1	GDEC	marker	765009795	765010398	.	+	.	ID=XwPt8845-3B;Name=XwPt8845-3B;marker=wPt8845;type=darts
traes3bPseudomoleculeV1	GDEC	marker	9584806	9585578	.	+	.	ID=XwPt8855-3B;Name=XwPt8855-3B;marker=wPt8855;type=darts

Writing: WDI working group
Creation date: 02 October 2014
Update: 31 July 2015

Wheat Data Interoperability Guidelines