{"id":19,"date":"2014-11-21T14:47:32","date_gmt":"2014-11-21T13:47:32","guid":{"rendered":"http:\/\/ist.blogs.inra.fr\/wdi\/?page_id=19"},"modified":"2018-11-12T16:40:44","modified_gmt":"2018-11-12T15:40:44","slug":"genome-annotations","status":"publish","type":"page","link":"https:\/\/ist.blogs.inrae.fr\/wdi\/genome-annotations\/","title":{"rendered":"Genome annotations"},"content":{"rendered":"<p>Genome annotation is a process of attributing structural\u00a0and\u00a0functional information to sequences. These annotations range from sequence similarities, biological functions, location of regulatory motifs, expression and interactions.<\/p>\n<h2>Recommendations<\/h2>\n<p><strong>Target user(s):<\/strong> Data managers, Bioinformaticians<\/p>\n<h2><span style=\"color: #ff0000\">Summary<\/span><\/h2>\n<ol>\n<li style=\"color: #ff0000\"><span style=\"color: #ff0000\"><a style=\"color: #ff0000\" href=\"#format\">Recommended data format: GFF3 format<\/a>.<\/span><\/li>\n<li style=\"color: #ff0000\"><span style=\"color: #ff0000\"><a style=\"color: #ff0000\" href=\"#specs\">Provide comprehensive content description for column 9 in the GFF3 file<\/a>.<\/span><\/li>\n<li style=\"color: #ff0000\"><span style=\"color: #ff0000\"><a style=\"color: #ff0000\" href=\"#goodpractices\">Consistent use of external database cross references\u00a0(Dbxref)<\/a>.<\/span><\/li>\n<li style=\"color: #ff0000\"><span style=\"color: #ff0000\"><a style=\"color: #ff0000\" href=\"#metadata\">Consistent use of ontologies<\/a>.<\/span><\/li>\n<\/ol>\n<h3 id=\"format\"><span style=\"color: #3366ff\">1. Data format<\/span><\/h3>\n<p>We recommend GFF3 file format for the representation of genome annotations.<\/p>\n<div class=\"zone_txt ezoe\">\n<h5><strong>Description<\/strong><\/h5>\n<p>The\u00a0<a style=\"color: #33aabd\" href=\"https:\/\/fairsharing.org\/FAIRsharing.dnk0f6\" target=\"_blank\" rel=\"noopener\">GFF3<\/a>\u00a0file format is widely used by the community and is a good option for representing genome annotations. However, descriptions with regard to specific columns need attention, for instance, column 9 \u201cattributes\u201d in the GFF3 file varies in the type of information it contains when compared to the rest that are specific (position, chromosome\u2026). The information contained in column 9 needs guidelines (currently ID and Name of the Feature are the mandatory information) &#8212; the other attributes are not specified, resulting in adopters using it in\u00a0different ways.<\/p>\n<h5 id=\"specs\"><strong>Specifics<\/strong><\/h5>\n<p>Guidelines for describing\u00a0content for Column 9 \u201cattributes\u201d:<\/p>\n<dl>\n<dt><strong>ID<\/strong><\/dt>\n<dd>Indicates the ID of the feature. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature.<\/dd>\n<dt><strong>Name<\/strong><\/dt>\n<dd>Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.<\/dd>\n<dt><strong>Alias<\/strong><\/dt>\n<dd>A secondary name for the feature. It suggests that this tag can be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.<\/dd>\n<dt><strong>Parent<\/strong><\/dt>\n<dd>Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth. A feature may have multiple parents. Parent can *only* be used to indicate a <em>part of<\/em> relationship.<\/dd>\n<dt><strong>Target<\/strong><\/dt>\n<dd>Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is &#8220;target_id start end [strand]&#8221;, where strand is optional and may be &#8220;+&#8221; or &#8220;-&#8220;. If the target_id contains spaces, they must be escaped as hex escape %20.<\/dd>\n<dt><strong>Gap<\/strong><\/dt>\n<dd>The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is taken from the CIGAR format described in the <a href=\"http:\/\/cvs.sanger.ac.uk\/cgi-bin\/viewvc.cgi\/exonerate\/?root=ensembl\">Exonerate documentation<\/a>. See &#8220;THE GAP ATTRIBUTE&#8221; for a description of this format.<\/dd>\n<dt><strong>Derives_from<\/strong><\/dt>\n<dd>Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural &#8220;part of&#8221; one. This is needed for polycistronic genes. See &#8220;PATHOLOGICAL CASES&#8221; for further discussion.<\/dd>\n<dt><strong>Note<\/strong><\/dt>\n<dd>A free text note.<\/dd>\n<dt><strong>Dbxref<\/strong><\/dt>\n<dd>A database cross reference. See the section &#8220;Ontology Associations and Db Cross References&#8221; for details on the format.<\/dd>\n<dt><strong>Ontology_term<\/strong><\/dt>\n<dd>A cross reference to an ontology term. See the section &#8220;Ontology Associations and Db Cross References&#8221; for details.<\/dd>\n<dt><strong>Is_circular<\/strong><\/dt>\n<dd>A flag to indicate whether a feature is circular. See extended discussion below.<\/dd>\n<dd><\/dd>\n<dd><\/dd>\n<dd><\/dd>\n<\/dl>\n<h3 id=\"goodpractices\"><span style=\"color: #3366ff\">2. Good practices<\/span><\/h3>\n<\/div>\n<ol>\n<li>Use homogeneous abbreviation tags for database.<\/li>\n<\/ol>\n<ol start=\"2\">\n<li>Consequent use \u00a0format column 9 \u2013 use Dbxref attribute.<\/li>\n<\/ol>\n<p>Dbxref is the ID of the cross referenced object in the form<\/p>\n<p style=\"padding-left: 30px\"><em>DBTAG:ID<\/em> &#8211; The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database. IDs can contain unescaped colons but DBTAGs cannot, so parsing code should split on the first colon encountered in the attribute value.<\/p>\n<p>Here are some suggestions for a homogeneous GFF3-format:<\/p>\n<p style=\"padding-left: 30px\">original GFF3 for Hordeum_vulgare (EnsemblPlants):<\/p>\n<p style=\"padding-left: 30px\">1\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ensembl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 gene\u00a0\u00a0\u00a0\u00a0 3656\u00a0\u00a0\u00a0\u00a0\u00a0 4845\u00a0\u00a0\u00a0\u00a0 .\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8211;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein\u00a0 [Source:UniProtKB\/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1<\/p>\n<p style=\"padding-left: 30px\">proposed format is:<\/p>\n<p style=\"padding-left: 30px\">1\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ensembl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 gene\u00a0\u00a0\u00a0\u00a0 3656\u00a0 4845\u00a0\u00a0\u00a0\u00a0\u00a0 .\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8211;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein [Source:UniProtKB\/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1;<strong>Dbxref=UniProt:M0Y5H6,EnsemblPlants:MLOC_65880<\/strong><\/p>\n<h3 id=\"metadata\"><span style=\"color: #3366ff\">3.\u00a0Metadata and Vocabularies<\/span><\/h3>\n<div class=\"zone_txt ezoe\">\n<p><span style=\"color: #000000\">We recommend the use of ontologies for functional annotation in column 9, such as,\u00a0<a style=\"color: #000000\" href=\"http:\/\/geneontology.org\/page\/ontology-documentation\" target=\"_blank\" rel=\"noopener\">Gene Ontology<\/a> and <a style=\"color: #000000\" href=\"http:\/\/www.sequenceontology.org\/browser\/obob.cgi\" target=\"_blank\" rel=\"noopener\">Sequence Ontology<\/a>.<\/span><\/p>\n<\/div>\n<h3 id=\"tools\"><span style=\"color: #3366ff\">4. Tools<\/span><\/h3>\n<p><strong>Convert data format<\/strong><br \/>\nYou can convert different formats to GFF3 using the <a href=\"https:\/\/bioconvert.readthedocs.io\/en\/master\/\">Bioconvert tool<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"transparent\" src=\"https:\/\/bioconvert.readthedocs.io\/en\/master\/_images\/conversion.png\" alt=\"https:\/\/bioconvert.readthedocs.io\/en\/master\/_images\/conversion.png\" width=\"570\" height=\"389\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>GFF3 validator &#8211; Genome tools:<\/p>\n<p><a href=\"http:\/\/genometools.org\/cgi-bin\/gff3validator.cgi\" target=\"_blank\" rel=\"noopener\">http:\/\/genometools.org\/cgi-bin\/gff3validator.cgi\u00a0<\/a><\/p>\n<h3 id=\"examples\"><span style=\"color: #3366ff\">5. Examples<\/span><\/h3>\n<div class=\"zone_pictos\">GFF3 sample from the <a href=\"https:\/\/urgi.versailles.inra.fr\/gb2\/gbrowse\/wheat_annot_3B\/\">3B annotation browser<\/a>:<\/div>\n<div class=\"zone_pictos\">\n<pre>traes3bPseudomoleculeV1\tGDEC\tmarker\t82454936\t82455352\t.\t-\t.\tID=XwPt1159-3B;Name=XwPt1159-3B;marker=wPt1159;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t771172313\t771172855\t.\t-\t.\tID=XwPt2416-3B;Name=XwPt2416-3B;marker=wPt2416;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t12174851\t12175713\t.\t+\t.\tID=XwPt2757-3B;Name=XwPt2757-3B;marker=wPt2757;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t586057169\t586057670\t.\t-\t.\tID=XwPt3327-3B;Name=XwPt3327-3B;marker=wPt3327;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t295038909\t295039410\t.\t-\t.\tID=XwPt3327-3B.2;Name=XwPt3327-3B;marker=wPt3327;type=darts\r\nv443_0484\tGDEC\tmarker\t134945\t135646\t.\t+\t.\tID=XwPt4933-3B;Name=XwPt4933-3B;marker=wPt4933;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t755916365\t755916938\t.\t-\t.\tID=XwPt5295-3B;Name=XwPt5295-3B;marker=wPt5295;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t236794223\t236794836\t.\t+\t.\tID=XwPt5390-3B;Name=XwPt5390-3B;marker=wPt5390;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t749409255\t749409819\t.\t+\t.\tID=XwPt5947-3B;Name=XwPt5947-3B;marker=wPt5947;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t736342105\t736342613\t.\t-\t.\tID=XwPt7301-3B;Name=XwPt7301-3B;marker=wPt7301;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t614658212\t614659360\t.\t+\t.\tID=XwPt7502-3B;Name=XwPt7502-3B;marker=wPt7502;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t765686199\t765687128\t.\t+\t.\tID=XwPt7514-3B;Name=XwPt7514-3B;marker=wPt7514;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t765009795\t765010398\t.\t+\t.\tID=XwPt8845-3B;Name=XwPt8845-3B;marker=wPt8845;type=darts\r\ntraes3bPseudomoleculeV1\tGDEC\tmarker\t9584806\t9585578\t.\t+\t.\tID=XwPt8855-3B;Name=XwPt8855-3B;marker=wPt8855;type=darts<\/pre>\n<\/div>\n<div class=\"zone_pictos\"><\/div>\n<pre class=\"zone_pictos\"><span style=\"color: #3366ff\">Writing:<\/span> WDI working group\r\n<span style=\"color: #3366ff\">Creation date:<\/span> 02 October 2014\r\n<span style=\"color: #3366ff\">Update:<\/span> 31 July 2015<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Genome annotation is a process of attributing structural\u00a0and\u00a0functional information to sequences. These annotations range from sequence similarities, biological functions, location of regulatory motifs, expression and interactions. Recommendations Target user(s): Data managers, Bioinformaticians Summary Recommended data format: GFF3 format. Provide comprehensive content description for column 9 in the GFF3 file. Consistent use of external database cross [&hellip;]<\/p>\n","protected":false},"author":11,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"open","template":"","meta":{"footnotes":""},"class_list":["post-19","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ist.blogs.inrae.fr\/wdi\/wp-json\/wp\/v2\/pages\/19","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ist.blogs.inrae.fr\/wdi\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ist.blogs.inrae.fr\/wdi\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ist.blogs.inrae.fr\/wdi\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/ist.blogs.inrae.fr\/wdi\/wp-json\/wp\/v2\/comments?post=19"}],"version-history":[{"count":0,"href":"https:\/\/ist.blogs.inrae.fr\/wdi\/wp-json\/wp\/v2\/pages\/19\/revisions"}],"wp:attachment":[{"href":"https:\/\/ist.blogs.inrae.fr\/wdi\/wp-json\/wp\/v2\/media?parent=19"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}