Webinar: Eukaryotic Genome Data Curation at NCBI
Articles Blog

Webinar: Eukaryotic Genome Data Curation at NCBI

Welcome everyone to today’s webinar, on Eukaryotic Genome Data Curation at NCBI. A recording of this webinar will be available on the NCBI YouTube channel. If we do not answer your question during the webinar, we will compile the questions and the answers and post them on the webinar page at the URL that is listed there. When you see the bracket — NCBI bracket, you need to replace that with the complete NCBI webpage URL, and you will be seeing this abbreviation in slides later in the webinar. All of the materials and the recording will be posted approximately one to two weeks after today at the 1.usa.gov URL that is listed there. That is an FTP directory. The files are there now, the FT — excuse me, the PowerPoint files, and also a PDF file. My name is Bonnie Maidak. I’m the webinar moderator. We also have with us Kim Pruitt, who is the RefSeq project team leader. Peter Cooper is with us as well, and he will be helping with some of the administrative details of the webinar. We have three RefSeq scientist presenters today. Diana Haddad will be giving an overview of the RefSeq project and also mentioning some of the related NCBI resources. She will also be going through small eukaryotic genome curation. Nuala O’Leary will be going through higher eukaryotic genome curation. And then our third presenter is Brian Smith-White, who will be going through some of the curatorial evaluation of the sequence data that are used to create RefSeq records. When you do have a question for RefSeq, please go to the webpage that is listed there, complete the form, and that will get a message directly to the RefSeq curators. Why did we want to do this webinar? Well, RefSeq curators answer these questions on a daily basis. The webinar is designed to answer at least some of these questions. We don’t have enough time to go through each one of these questions. But this is an important consideration for you, the attendees of this webinar, and those of you who work with genomic data to understand the RefSeq process. This slide is showing you an overview of both the webinar and the RefSeq project. Each presenter will point out the part that they are going to be discussing, and you’ll be seeing in the webinar slides, or hearing, INSDC. That’s an abbreviation for nucleotide sequence databases, where people can deposit their sequences, submit their data, and that represents the GenBank database at NCBI, the ENA database in England, and then the DDBJ database in Japan. Now we’re going to have our first presenter, Diana Haddad, who is going to give us an overview of the RefSeq project. Thank you, Bonnie, for your introduction. This curation endeavor is part of the NCBI RefSeq project, so the very first part of my talk will introduce the RefSeq project, as well as some of the most relevant resources to RefSeq. The RefSeq database is a non-redundant set of curated and computationally derived genomic, transcriptomic, and protein sequences. It provides standards that are enriched with current knowledge of sequence and function, including publications, functional features, informative nomenclature, and feedback from outside users. Our primary aim is to provide consistency in genome sequence and annotation through a standard method of curation. You’ll find extensive information about the RefSeq project in the publication indicated in the bottom left corner. All forms of life are represented in RefSeq, including genomes from organisms that affect human health; high-profile organisms that are research models, such as Drosophila melanogaster, the fruit fly; or the yeast; or organisms that are targets from large-scale genome sequencing. RefSeq public releases occur every two months, and these are the stats for the latest RefSeq release. We’ll have another one coming very soon. In the first part of my presentation I’ll cover the initial steps involving processing a eukaryotic genome into RefSeq, beginning with the submission of an INSDC genome to NCBI, its conversion to a RefSeq genome, and I’ll discuss the differences between the two alternative pipelines we use for eukaryotic genome processes. We denote these pipelines as large and small, as I’ll discuss in a moment. Now let’s look at what happens to the submitted genome in order to become a RefSeq genome. RefSeq relies on the submission of a genome assembly to GenBank, and in some cases also on the submission of raw sequence data to the sequence read archive or SRA. At GenBank, the assembly undergoes a thorough review process before the assembly is assigned an accession number and is released to the public database as a GenBank genome. Here you see the GenBank assembly prefix with its unique identifying GCA prefix. The GenBank assembly containing the nucleotides may then be copied into a RefSeq assembly with accessions that are different from the GenBank ones. The RefSeq assembly prefix is GCF. The difference between the two assemblies is that the GenBank assembly is owned by the submitter and, therefore, cannot be curated by NCBI. However, the RefSeq counterpart is owned by NCBI, and this data can be curated to improve, update, and/or correct the annotation, but we do not edit the genome sequence. This is, in short, the purpose of creating a RefSeq record or genome to curate. So before we can curate the data, the RefSeq assembly needs to be processed by either of two alternative process goals. It can be annotated with a large pipeline, whereby the annotation is generated based on available transcript data and protein homology. The annotated genomes are then publicly linked to the Gene database, as you’ll hear later, or with links to the BLAST database to perform queries against the annotated genome, or to FTP sites to download the annotation data, or to access a detailed annotation report summary. In the case of genomes that cannot currently be processed by the annotation pipeline, the GenBank annotation is propagated for the creation of RefSeq records from where curation to standardize and improve the genome annotation is done, as I’ll discuss in a few moments. To expand on these two alternative RefSeq pipelines, it is the animal and the majority of plant and insect genomes that are annotated by the large genome annotation pipeline, while the alternative pipeline, whereby GenBank annotation is propagated into RefSeq, is mostly used to process smaller size genomes, such as algae, fungi, protozoa and nematodes. We call this pipeline the small eukaryotic pipeline. The main reason why the small genomes cannot be processed through the annotation pipeline is because they frequently lack sufficient transcript data to use for their annotation. The large annotation pipeline generates new annotation, new models, as we’ll see in a moment. So the annotation in these RefSeq records may or may not exist in GenBank. For the small eukaryotic pipelines, besides copying the GenBank record, mRNA features are added to the RefSeq record, as well as a unique gene identifier for each gene, protein, and transcript feature. Other format changes are also made to the record to make it consistent with the RefSeq standard form. I’ll give examples of such changes later in the presentation. All data thus generated by both pipelines exists as a collection of RefSeq gene, genome, and assembly data stored in databases. You can find more information about these pipelines in the corresponding publications. In this next slide we have outlined the basic steps of each pipeline in more detail, beginning with its submission. In the large annotation pipeline we start by aligning the known RefSeq transcripts and proteins that are subject to curation to the genome. You will see these known RefSeqs denoted with NM, NR, and NP accessions. Now GenBank mRNAs and proteins, as well as RNA-Seq reads are also used for alignment to the genome. Then the gene model predictions based on transcript and protein alignments is performed by Gnomon. This is the NCBI gene prediction software that provides de novo annotation. So these Gnomon-predicted models are solely computationally derived, and those that are selected for a final annotation set are assigned model RefSeq accessions with XM, XR, and XP prefixes to distinguish them from known RefSeq with the NM, NR, and NP prefixes. This clarification will be useful in the latter part of the presentation. From here on scientists devote — RefSeq scientist devote — significant effort in validating these models, as you will be learning from the next two speakers. The small eukaryotic pipeline is not an annotation pipeline; however, after duplicating the GenBank record there is a process of conversion and record cleanup, whereby the pipeline reports sequence and annotation errors. The curator revises them and recommends fixes to incorporate into the pipeline so that those errors are fixed. There’s also a system of gene and accession tracking so that the data can be easily tracked between the different submissions. Here the RefSeq features also receive an X accession, but I’d like to clarify to avoid confusion that in the context of the small pipeline, these are not predicted models, as they are in the context of the large annotation pipeline. Here in the small pipeline they are only copies of the annotation provided by INSDC. Now what type of assemblies are incorporated into RefSeq? But before this, the simplest definition of an assembly is a collection of genome sequences that represent the genome of an organism. We prioritize those organisms for which high-quality genome assembly is available. These are the minimum requirements curators look for in selecting an assembly; thus, we may exclude some categories of data that don’t meet those quality standards. For example, highly fragmented genomes or metagenomes, which are sequencing projects of whole ecological communities. Let’s look at a few resources relevant for the curator, as well as for submitters and NCBI users to know about. This is an example of the NCBI Assembly data page, which represents both the GenBank and RefSeq assembly data. This is the top part of the page. The database provides an assembly accession and version to identify unambiguously the assembly, and to track changes between update assemblies respectively. So here you see both GenBank and RefSeq assemblies with the respective version number as the suffix. The page also offers numerous links to different reports and files, such as FTP sites from where one can download sequences and annotation, links to the BLAST database page to perform queries against the annotated genomes, and links to corresponding INSDC or RefSeq nucleotide records, or to many other resources. There’s also a statistical data section pertaining to the assembly composition and quality, and there’s a history section to show the current GenBank and RefSeq assembly versions, as well as previous versions of either assembly. The bottom half of the page presents an assembly definition tab that provides the names and accessions for each chromosome or replicon in the assembly, with links to the data, as well as an assembly statistics tab with detailed statistics for the assembly components. This is part of the data that curators look at in analyzing and determining whether an assembly is of sufficient quality. I’d like to draw your attention to the genomes resource link at the top of the page, which leads to the Genomes database page. This resource compiles extensive sequence and annotation data for genomes that have both GenBank and RefSeq assembly data. The database is organized into organism-specific overviews that function as portals from where all deriving bioprojects pertaining to that organism can be browsed and retrieved. There are different ways of accessing such data. For example, in this box at the top, you will find a link to all Plasmodium falciparum genomes in the link on this list, and in this box you will also find extensive sequence and annotation data. Much of it is curator reviewed data that has been propagated into it. There’s also a short biological description section, and the bottom portion presents statistical information on the composition of each chromosome or scaffold, as the case may be. And up here, a link to the genome assembly and annotation report page, which is another way of retrieving the list of genomes for its particular species with accompanying assembly data and statistics. So the main purpose of the Assembly database is to provide a means to identify the assembly and to track changes in the sequences that comprise it while the Genome resource page’s main purpose is to serve as a portal for retrieval of bioprojects and data relating to a species. Thus, by reviewing these RefSeq resources, we hope to encourage users to navigate through them and to take advantage of the wealth of data offered, and also to give submitters a better idea of the submission and RefSeq process at NCBI. The last few slides of my presentation will focus specifically on the small eukaryotic pipeline. Let’s look in more detail at the small euks pipeline with its validation checks and the curator’s role in procuring the RefSeq data. And besides what I have already said about this pipeline, I’ll add that these small genomes must be reasonably well annotated in order to be considered for representation in RefSeq. As part of the validation and curation process in this pipeline, there are different checkpoints to ensure that there aren’t any annotation or sequence problems or format issues, missing features, or even misspellings. This is an example of what a validator report generated by the pipeline would look like. There are three categories of issues that can be reported; errors, which must be fixed before proceeding, and pipeline fixes are made based curator recommendations. A second category consisting of warnings, and these also need to be evaluated by the curator to determine whether they represent annotation artifacts that must be corrected or if they are acceptable due to the biology of the genome. And then there’s a third category in the report, which is mostly informational and nothing needs to be done about it. This is an example of three different errors in the submitted annotation, all in the same record. In the first two examples there are different features missing from the INSDC record. One is the chromosome from the source, which has been added to the RefSeq record. The other missing feature is the protein name. And RefSeq requires one. And if none is provided in the INSDC record then we apply a default name for the protein, uncharacterized protein. Thirdly, the gene coordinates in the INSDC record are starting at base three, and the pipeline fix is to start at position one. And although not an error, I would like to point out what has been mentioned several times before, the addition of the mRNA feature to the RefSeq record with its transcript ID that leads to the mRNA record. Also, a GeneID is added to all three features; gene, mRNA, and protein, or CDS for coding sequence. The GeneID is a unique identifier assigned to each gene in the RefSeq genome that allows identification and tracking of each set of features between submissions. I’d like to show you another example of an error that needed to be fixed. In this case we have two gene variants with different locus tags. For RefSeq representation the pipeline code was adjusted to be able to join multiple gene variants with different locus tags into one gene feature. The first locus tag in short order was picked for representation, and the other one is represented as old_locus_tag in the RefSeq record. These examples should give you an idea of the curation and validation process involved in the small eukaryotic pipeline. We are currently working on improving the small euks pipeline; for example, a protein name cleanup and curation mechanism, a gene and accession tracking system to track the data between different assembly submissions, and we will be incorporating a tool that can detect and remove foreign contamination from the genome, such as that of bacterial origin. Future developments include a more automated pipeline, whereby the curator intervention is limited to revising egregious errors; the ability to run several assemblies and batches. This will be particularly useful in cases where a problem is detected in many RefSeq assemblies. The overall goal of these improvements being to provide consistency and quality across datasets, as well as higher throughput. Here are a few examples of unacceptable or poorly formatted protein names that require correction. In the first example, a PubMed ID as part of the title, a description of the evidence used for its inference, a poorly formatted title displaying equal symbols; and the second one, an overly long list of names or functions; and in this third case we have an organism name as part of the protein name, which is considered unacceptable by standard protein naming rules. And it’s also followed by the term “genomic content,” which is in conflict with its protein nature. In such cases of uninformative overly long name or poorly formatted names we assign uncharacterized protein to the protein name. Further examples of unacceptable protein names that should be cleaned up and the curator proposed renaming include the category of uninformative names or proteins with no known or predictive function. These should be assigned uncharacterized protein for a first pass, with the intention that at a later stage more informative protein names would be assigned. Another category that denotes a certain homology or evolutionary relationship to other organisms, so there’s an indication that the protein may be real, would be reassigned as putative protein, again, as a first pass, with the intention that in a later stage more informative names would be reassigned. The purpose, then, of the protein name curation is to standardize the protein names, given the variable name input we received, and also to be able to assign more useful protein names. To conclude this first part of the presentation we currently have 236 small eukaryotic organisms currently represented in RefSeq, which have been propagated from INSDC annotation. So, although the pipeline has only recently reached the public production phase, it is already contributing to an increased number of small eukaryotic genomes represented in RefSeq. Let me now introduce you to Nuala O’Leary who will be talking about the higher eukaryotic pipeline. Thank you, Diana. In this section of the webinar I will discuss how NCBI curators review genome annotation for higher eukaryotes. This curation combines sequence analysis with literature review and collaboration to generate the most accurate reference sequences for a particular gene. I will also discuss how this data can be accessed in the NCBI databases. The curation efforts I will outline pertain to organisms that are annotated through the eukaryotic genome annotation pipeline, which to remind you, differs from the previously discussed small eukaryotic genome annotation pipeline, and that it involves de novo annotation based on the alignment of transcripts, proteins, including known RefSeqs, and RNA-Seq data to the assembled genome. This pipeline also produces computationally derived models. Organisms in scope for this annotation are presented in this chart. To date, we have annotated over 270 organisms using this pipeline that range from mammals to fish, invertebrates and, more recently, plants. Although all of these organisms are in scope for manual review, a smaller subset is the primary focus of manual curation due to their importance to biomedical research and agriculture. So how does manual curation fit into the annotation process? Curators generally review genes that have been targeted for manual review because they have data conflicts identified by quality assurance tests. We do in-depth sequence analysis, literature review, and consult with collaborators to determine the correct sequence, correct gene type, to ensure the correct genetic location, accurate nomenclature, and to create alternate splice variants. Curators follow a set of established guidelines in examining each gene to ensure accuracy and consistency among curators. Manual reviewers also apply important biological information to the reference sequence. The resulting known RefSeqs are fed back into the pipeline to improve the next annotation run. Curators also communicate with programmers to identify computational errors. I’d next like to give you a general overview of the sequence review process. RefSeqs are assembled from mRNA and EST data submitted to the GenBank archive. Curators use an in-house alignment program to identify high-quality full-length transcripts for each gene. We confirm that the transcripts have an accurate and complete open reading frame. If there are partial transcripts that suggest the five prime and three prime ends can be extended, these are used for UTR extensions. Our goal is to represent all potential variation for each gene; therefore, we look for full-length transcripts that support splice variation. We correct sequence errors in the transcript and protein sequence in assembling the final RefSeq. For genes undergoing our highest level of review we write gene descriptions and add biological information to the record, particularly in cases where the information is not easily predicted by computation. The RefSeqs undergo validation checks and the data is propagated to the RefSeq transcript and protein records. The RefSeq nucleotide record clearly displays the gene name and the unique RefSeq accession prefix. Curators ensure the association of relevant publications. Gene descriptions that are written by curators are found under the summary section. For genes that have multiple splice variants, we have a section that describes the splicing differences. We also report the GenBank, mRNA, or EST that supports the full exon combination of the RefSeq. Curators also provide coordinates for the poly(A) site and signal when available. The RefSeq protein record provides the protein name, along with the unique RefSeq NP accession. Protein records contain additional feature annotation that is provided either computationally or based on literature review. Features such as signal peptides are calculated using SignalP:4.0. We propagate protein features from SWISS-PROT when the SWISS-PROT has a high quality alignment to the RefSeq protein. When features such as these are imported from external sources we indicate this on the record. In cases where there is a conflict in the feature propagation, or the feature annotation is complex, curators apply protein features annotation manually. RefSeq data can also be easily viewed in NCBI’s Gene Resource. This database provides a comprehensive and centralized view of gene-specific information. The information is organized in sections that provide specific functional information and related resource and has interactive tools such as a graphical overview of the annotated gene. In the next few slides I’ll go over how RefSeq data can be viewed in some of these sections. The first section is a summary section which provides the official name and symbol, along with a link to a nomenclature committee, when available. This section also indicates if the RefSeq record has been reviewed by manual curators and shows the gene description written by curators. The genomic region section shows the graphical display of the annotated genes. This section is an interactive tool that displays the intron-exon structure of all RefSeqs that were available at the time of the last genome annotation. The display includes both known and model RefSeqs. The genome sequence section indicates the current assembly that was annotated. The default tracks include RNA-Seq intron-exon support. However, I’d like to point out that there are numerous other data tracks that users can adjust to customize their view. For those interested in knowing more about the available tracks there will be a specific webinar on track management this Thursday, January 7th. You can follow the link indicated here to register for this webinar. The last section I’ll discuss is the reference sequence section. This section of the gene record displays all RefSeq transcripts and proteins for that gene including model RefSeq. The number of RefSeqs may differ from the graphical display because it will have RefSeq transcripts and proteins that have been added since the last annotation. These newer RefSeqs may not be visible in the graphical display. So I’d like to return to our curation process flow. I previously mentioned the basic curation for protein coding genes. However, RefSeq represents multiple gene types, including many non-coding genes, which can be identified by the NR accession prefix. A particular focus has been to expand our representation of long non-coding RNAs, which are loosely defined as transcripts greater than 200 nucleotides in length that lack coding potential. This effort involved a thorough review of the literature to first represent those linked RNAs with known function, many of which have been implicated in disease. For uncharacterized lnc RNAs curators follow a set of guidelines similar to protein coding genes, which requires full-length high-quality transcripts to represent a locus. We also curate pseudogenes, which are genes that have a similarity to a functional gene but have lost their protein coding potential. We classified two types of pseudogenes. The first is those that are transcribed but have lost a functional open reading frame. These are represented by NRs. The second is genomic regions that are similar to a functional gene but are not transcribed, and these are represented by the NG accession. As I mentioned, our sequence review process has historically been transcript based; however we recognize that many genes have limited transcript support, and there are additional data resources that can be used to better represent these genes. In these cases curators are using RNA-seq data to infer the full-length structure of the transcribed product. Curators confirm through protein BLAST analysis that the inferred RefSeqs are full length and supported by orthology. When the data is available we also use promoter associated histone marks for evidence of a complete five-frame end. We strive to be transparent to the user when a curated RefSeq is partially inferred from RNA-Seq data. For these RefSeqs the record are flagged with an inferred exon combination attribute, and users can also see the RNA-Seq alignments that support the inferred exon in the previously mentioned gene graphics display. In the next few slides I will discuss how curators apply additional biological information to the RefSeq record. One of the unique contributions of RefSeq is that curators are able to manually integrate functional features within the reference sequence. This is particularly useful in cases where biological features cannot be computationally propagated to a record and require in-depth sequence analysis and literature review. These RefSeqs are flagged with an attribute that highlights the particular biological feature. A list of these attributes is shown here, and in the next few HMMMM slides I will discuss in more detail two of these in-depth annotation projects, our annotation of known regulatory upstream open reading frames, or uORFs, and genes that undergo ribosomal slippage. Regulatory uORFs are short open reading frames located in the five prime UTR, whose translation may negatively affect the translations of the downstream primary uORFs. Although these uORFs could be predicted computationally, the effect any one uORF has on the translation of a primary uORF in a particular transcript can only be determined experimentally. Therefore, we reviewed the literature for genes with known experimentally determined regulatory uORFs and updated the RefSeq to indicate the location of these ORFs on the RefSeq and indicate the relevant publication and provide an attribute. The second effort was to accurately represent the gene product of the vertebrate antizyme genes. These genes do not follow standard decoding rules in that they require a plus-one frame shift to encode the full-length antizyme protein. As a result, the frame shifted ORF cannot be easily calculated computationally, and, therefore, required manual evaluation to be accurately annotated. This effort involved review of all the vertebrate antizyme genes. The example shown here is the zebrafish, Ornithine decarboxylase antizyme, where the RefSeq is flagged with a ribosomal slippage attribute, and the position of the frame shift is clearly indicated on the RefSeq record. This annotation can also be seen in the Nucleotide graphics display, where the position of the frame shift is indicated on the graphical display on the RefSeq transcript, where this annotation is not visible on the GenBank record. The last example I would like to discuss is how RefSeq curators work with both users and collaborators to provide the most accurate annotation. This is an example of a chicken ribonuclear protein that a user pointed out was incorrectly labeled as an A3 family member when it should be an A1. We reviewed the data and communicated with the Chicken Gene Nomenclature Committee to coordinate renaming of the gene. These efforts led to correction of the gene name, which is now visible on the RefSeq with a link to the correct CGNC record. Thank you. And I will now present my colleague, Brian Smith-White. He will discuss in more detail how curators resolve more complex sequence problems. I’m going to talk about the part of the slide that’s been highlighted in green, which is the manual curation of problems that we encounter. The first part I’d like to highlight is the in-house curation tool, which we will be seeing examples from many times in my talk. The very top part is the coordinate from the genome sequence and is correlated with the green bar in the display; the light green bar in the display. On the left-hand side are listed the INSDC and RefSeq accession numbers, which correspond to the aligned transcripts. The RefSeq are shown having either known RefSeq with an N — that’s the first letter — or a model RefSeq with an X as the first letter, and all RefSeq accessions have an underscore. The open rectangles represent places where the transcript sequences align to the genome sequence. The top green arrow or green bar with arrows in it is the annotated gene in the genome sequence. The next track is the coverage heat map. The next two tracks represent the RNA-seq data. The top one is the exon coverage and the bottom is the intron spanning coverage. Not shown in the large picture but available is the strand orientation. It will become important in a later example. The yellow region is the coding region, the CDS, or ORF, open reading frame, for that particular transcript. The second part is the curation tool allows the display of the errors in the aligned sequence relative to the genome sequence. The red tick is a SNP. They align between two aligned regions is a — no sequences align to that region in the genome. The black triangle was an insertion in the transcript compared to the genome, and the open triangle is a deletion compared to the genome sequence. For the places where we have introns, the display has the consensus splice donor acceptor position where they are. We use this display to help with the correction of errors. As you can see in the RefSeq there is an on-the-line region for which there is no intron spanning RNA-seq, so, clearly, that gap, in addition to not having the proper donor and acceptor sites, there is no support for that gap, so we take another INSDC accession, which matches the genome appropriately, and we create a new known RefSeq. We can merge two genes when the transcripts align to the same region of the genome. Here we have the red. The two red ones are previous RefSeqs that have been suppressed because they have sequence errors, gaps without any support for introns, and in the process of merging, we use the one from this particular accession to become the accession for that gene. We can build the replacement transcript piece-wise from INSDC accessions. We can see why we did it. The red accession here has two problems in this part, and a SNP here, but otherwise it appears to be good matches for the introns and exons. So we start off with the five prime end. Using that accession, we add to it this part of this accession. We add to it this part of this accession. We add to it this part of that accession, and part of that accession, and we wind up with the black RefSeq that’s shown in the picture. Our curation that we do is transparent in the records when you get them from Nucleotide. We have the original sequence and the current sequence both have the same accession but the versions have been incremented. The fact that there has been curatorial evaluation shows up when the provisional RefSeq becomes a validated RefSeq, we indicate explicitly when the RefSeq has been recurrent — the original RefSeq has been replaced by the current RefSeq, and we show exactly how the current RefSeq was built piece-wise. We can correct truncated proteins. We have a region here where we have a model, the XM, and the NM, which we enlarged here to show exactly how the shortening of the NM eliminates the upstream start codon. The first question we have to ask is are the proteins that are encoded appropriate? We take the short CDS, which was from the known RefSeq, and we compare it by blastp to a sorghum homolog, and we compare it to the same sorghum homolog, the protein from the XP, the model. And it’s very clear that the current known RefSeq is missing 86 residues in the N-terminus. So the extension is built piece-wise, as described earlier in the talk. We can clean up annotation in a region. Here is a region in which there is a single gene. It looks to be okay, except the NM, the known RefSeq, lacks the appropriate introns with the correct splice donor. But more importantly, it’s on the wrong strand. And the question becomes is the protein really real. We compare the protein from a transcript on the right strand to the known RefSeq using blastp to other plant proteins. These minor strand transcript encodes a protein that shows no relation to any other plant proteins, whereas the plus strand transcript shows relationship to many plant proteins. We can use the RefSeq resources to identify genome problems. We store these in-house in databases, and when working with organism communities, we convey the information back to them. We have an assembly gap. Assembly gaps in introns are no problem. They do not interfere with the annotation display, so we pretty much ignore them, though we might keep track of them. However, assembly gaps in the middle of an exon cause problems. For the first part, the problem is there are no acceptor donor sites in that gap, and the lack of them is supported by the lack of intron spanning RNA-Seq. Additionally we’ve noticed that introduced gaps tend to result in sequencing errors adjacent to the gap. Also, assembly gaps can truncate transcripts, both on the five prime end and on the three prime end, and so the genome is missing this part of the gene; however, the RefSeq accession does contain that sequence. Additionally we can get some information about genome sequence quality. In this example we see that there is a consistent identical error that occurs at the same location and the same type of error, and it’s highly unlikely that all of these sequencing efforts made the same error while the genome had no error. It’s more likely that the transcripts were sequenced correctly and the genome has a mistake. Bonnie Maidak will now continue with the remainder of the talk. Thank you, Brian, and Nuala, and Diana. So these were three RefSeq scientists or three RefSeq data curators. They are part of the group that is shown here, so there are several other RefSeq data curators, and we wanted to make sure that you recognized that this is a small but very powerful group in the sense of their data analysis efforts here at NCBI. This webinar will also have a relationship to the Plant and Animal Genome meeting that is starting this Saturday in San Diego, California. And both Brian and Kim will also be at that meeting. We’ve compiled on this slide all of the links and PMIDs that were mentioned earlier in the webinar, so this is a convenience for you to get one slide with all of the links. And if we have any questions we’ll take them at this time. And I want to mention, if you do have questions specifically about this webinar, please use the [email protected] e-mail address. If you want to ask about other NCBI resources that were not included in the webinar, you can send a message to [email protected] If you have corrections for RefSeq data, then please go ahead and use that webpage. I’m going to read a couple of the questions that we received during the course of the webinar. One question was if there were any RefSeq records that were predictions but yet they had the NM or the NP accession prefix. And I’m going to ask Kim Pruitt to come to the microphone so that she can address this question. So in the higher eukaryotic annotation pipeline the NM and NP accessions are reserved for those data that have primary evidence support within INSDC databases. So they are supported by transcript cDNA RNA-seq data. They are not predicted models. Within the context of the small eukaryotic pipeline we are slowly shifting to using the X accession series because we don’t necessarily know for every genome how much curation versus computational prediction was used to provide the submitted annotation. Okay, thank you, Kim. A second question is whether NCBI resolves RefSeq and UniProt conflicts? And, again, I’ll ask Kim to address this. So RefSeq doesn’t alone resolve these conflicts, but we do coordinate closely with the UniProt group. The degree of coordination varies by organism, and human and mouse are two organisms for which we have a deeper level of coordination through the consensus CDS or CCDS collaboration that includes our partners in Ensembl, UCSC, the GENCODE Curation Group, Unicode — and UniProt I mean, and RefSeq curators. But we do coordinate on some other organisms, more on the level of case-by-case things that we happen to notice rather than a consolidated effort on both sides. Thank you, Kim. Again, the questions and the answers will be posted in a PDF document and put in the FTP directory. If you do need any additional information, you can go to the links that are there on the slide. But this concludes today’s webinar on eukaryotic genome data curation at NCBI. I hope all of you who have attended understand better the process that we do to make sure that we have the best data available for you, the researchers and scientists.

2 thoughts on “Webinar: Eukaryotic Genome Data Curation at NCBI

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top