NCBI Minute: Using organism (taxonomic) information with standalone BLAST
Articles Blog

NCBI Minute: Using organism (taxonomic) information with standalone BLAST

Hello everyone this is Peter Cooper from the
NCBI, welcome to today’s webinar on standalone BLAST 2.9.0 information relating to the database
5 format This has various titles. A shorter one, we are going to talk mainly
about using organism information with standalone BLAST+ (2.9.0). And BLAST DBv5. This is a webinar I have already given almost
a year ago, but we wanted to update you on some things about 5 and remind everybody and
to talk to those who did not hear us the first time on what you can do with BLASTdbv5. So what we are going to do today is give a
little motivation behind why we made the changes, talk about some features, show you how to
limit your search by taxonomy, search sequences by accession faster and we will show you how
to retrieve sequences by taxonomy from a BLAST database. It’s useful to understand why we made the
changes. The main reason is to prepare for changes
that are coming to sequence identifiers, moving to systems that use accession only sequence
identifiers rather than the integer IDs which are called GIs. There are also changes in the protein chain
identifier coming from PDB, the structure database We need to make some changes in the database. That is all well and good and it sounds like
we are doing this because we have to do it for our own purposes but you get things out
of this. One of the things you get is something people
have wanted for a long time, to have the taxonomic information built-in to the BLAST databases. These make your organism searches much easier. The database system we are using, LMDB enables
faster lookup of identifiers. You will also get access to proteins from
gi-less projects. We are now producing proteins that do not
have G.I. You cannot see those in the old BLAST databases
and when we get the new PDB records with the long chain IDs, you will be able to access
those in the BLAST databases. Just let me point out something. All of the software relies on integer identifiers,
there are various ones for our databases. The ones for the sequence databases are called
GI numbers, they are integers and all systems have traditionally depended on these identifiers. These are basically ints in terms of program
language so there is a limit depending on if they are signed or unsigned of 2 billion
up to 4 billion. The thing I want to point out is we are already
at about 1 billion GIs so that means we are going to run out. so this will be a problem and we will have
to rewrite a lot of software so we will move away from using GIs and using accession dot
versions instead. And because the number of GIs we have left
is somewhat limited, we are kind of rationing. For some projects, in particular the whole
genome shotgun projects that are coming in now that we do make proteins, we have stopped
assigning G.I. numbers. so there is a whole bunch of these that you
can see in the sequence set browser. Nucleotide and protein sequences that do not
have GI numbers so you cannot find these by searching the resources by using ordinary
queries, you can find them by the identifiers and these proteins in particular will not
be present in the v4 blast databases, the old versions. Just to point out, many of these proteins
are coming from pathogen detection projects which is collaborative between a number of
government agencies involved in identifying and analyzing foodborne illness outbreak. This is one of the major sources of gi-less
proteins and these are important because the organisms are important and so are the proteins
so we need to have access to those and that’s why we need the new BLAST version and the
new database. So what has been up for a while at NCBI is
sort of not quite a fully DBv5 database which means it does have GI numbers. In order to work with the fully gi-less database
you will have to upgrade both the BLAST binary and get the new DBv5. Executables for 2.9.0 are available in the
usual place on the BLAST ftp directory and the /LATEST/ link in the ftp directory. 2.9.0 will work with version 4 but I want
to point out we will stop updating version 4 eventually and we are thinking about stopping
after September 2019. In order to work with the newest databases
which are not yet on the FTP site, you will need to upgrade to the 2.9.0. The updated GI-less databases for version
5 should be on the FTP site we are thinking May 20. And I have a little footnote although it’s
not something I want to advertise as a great thing, a side effect of using these on the
web is you will notice you cannot use Entrez queries. the organism ones are still there and that’s
what most people need. So speaking of that, the thing I always tell
people when they do a BLAST search on the web is most importantly limit by organism
and you can do that by typing the organism name in the bottom of the BLAST form where
it says organism to include or exclude. In standalone BLAST the way in the past it
would be by gi list and that will not work anymore with version 5 databases. So in order to do this you can do some other
things. They do not require using GIs, one is using
the seqidlist. Those can be accessions, so you can use accession.version
format to limit to a set of sequences in the database. You can also exclude using a negative list,
again that will have the accessions you don’t want to see. And then there are a set of things that have
to do with the baked in taxonomy in the database. So you can search for a particular tax ID
or a small set that you can pass as an argument, or pass as a negative argument so you can
exclude certain groups. Or you can upload a file of taxids to either
include or exclude. These things work on search programs for blastdbcmd
, the program used to extract things from the databases, you can retrieve sequences
for particular taxids or using blastdbcmd. Looking at any of the databases that are 5
format, you can get -tax_info to look at what is in there and how many sequences each taxa
has. So how do you get a list of taxids ? You can
do that on the web easily, so you can go to the taxonomy database at NCBI and search for
example a set of organisms so I’m getting all taxids for green plants in this case. green plants[subtree] You can download that
or I could do exclusion if I want a particular set of taxids in this case for green plants
without the monocots, green plants[subtree] NOT liliopsida[subtree]
You can create a file and download it and use that to limit your search. In this case this is about 209,000 taxids
which is not too bad you could download in a few seconds. You could also do it from the convenience
of your command line with a shell script that comes with the blast release and it is,
a shell program. You can either input a name and get back the
ID or you can input a taxid to get all the taxids included in that taxon. This uses our EDirect package so that has
to be installed and for that reason only works for Unix Linux or Cygwin. So you don’t have to use those, you could
do it on the web, but this is pretty convenient. So here are some sample command lines that
use these various new options that are available now. We will go over some of these on the command
line directly. So I won’t go over these by reading them aloud
to you I don’t think that’s helpful but you will find if you look at the slides I gave
you there are some extra slides that I used when I gave my talk earlier. Okay so let’s go back and we will do some
live demos which is always exciting because they don’t always work but that’s fine. We are going to look at a GI-less protein
so you can see these do exist, we will get IDs for soybeans and flowering plants. I actually think I changed that to the Pea
family rather than flowering plants because that was such a large set. And then we will show how to limit a blastp
search againsst nr to those particular taxa. We will show you how to extract certain sequences
from nr using the tax ID list and we will show you how to use the seqidlist converted
to a binary form so that it works much faster. So I’m going to escape out of my slides and
I will move over to the command line. I’m going to switch back and forth because
I cannot type so I will copy, I have the text file up on the ftp site and you can look at
it. It is in your handout as a word document so
you can see the command line. I will just copy and paste so we can see them
in action. Let me get a terminal window up. So I will switch back and forth between this
and my web browser. So the first thing I am going to do is to
show you there are GI-less proteins in the database. What I am doing is pointing at a copy of this
on the system. So this is just a typical use of blastdbcmd
to get FASTA format for this protein updates for this particular protein. This is one of these proteins that came from
the pathogen detection project and this is one of the germ resistant proteins from the
salmonella. Now how do I know this is GI-less ? Let’s
see if we can pull this out of the version 4 database. And I just pointed to v4 here and you see
that doesn’t work. And there is a bunch, a couple other command
lines that I could use but you can see there really are not any GIs in the dbv5 database. Which remember version 5 is not yet available
on the FTP. Let’s go ahead and move on. We will look at some of the basic things we
talked about a minute ago. So we will switch back and forth a little
bit. I will open a new tab over here. So what I am going to do is do a search and
I will go ahead and load this directly. This is what I can do in taxonomy on the web
to get taxids. So you can take any organism name, go to the
taxonomy database this is a species level taxid which is what you will need to work
with the BLAST databases. Really it’s a leaf node tax id. So I could have the taxid as 3847 so I can
easily find that on the web. I can also do command line style and I can
use the script provided with the BLAST distribution. We will see that in a minute. Notice I can get all of the taxids for a particular
taxon by searching for the pea family for example. So that is the Entrez query that I need, pea
family and then subtree in square brackets, pea family[subtree]
it gives me 13,000 taxids, I can then save these to a file with the Send to menu. Save this as format taxid list and I can create
the file but I have already done that so that is on the directory for the webinar, to save
time I want download that. We can also do the same thing using script. Let me go back over here. This is the get_species_taxid shell script
Like many of the programs just type help at the end to get your options. This is a pretty simple script. So the taxonomy ID is one of the arguments
I can give it or I can give the name of the organism that I am interested in. So let’s say I want to do the pea family which
is a fairly large group of plants. So that is the taxid, now I can use is to
get all of the taxids I would need to make my list. The same thing I did on the web a second ago. What I
will do, I won’t redirect to a file right now. I will go ahead and just show you part of
the output. So you would just get the long list of taxids
. In fact that file is here, I can count how
many are in there. And that is the same number we had on the
web, 13,176. Okay let’s do some other more interesting
things. Let’s limit the BLAST search for a particular
taxid. So let’s search the nr database and have a
bunch of options on the command line that are useful. We have a particular tabular output. This will be searching only those taxids,
the soybean, 3847. And I have this protein argo I will search
against that. I will go ahead and put the output to the
terminal. It will be a little faster. This search depends on how many people are
using my BLAST machine as to how fast this will be, so let’s give it a shot. If I get tired of waiting I have the results
available we can look at. We will give it about 15 or 20 seconds and
then I will go ahead and bail on it. Here it is. The total output is similar and it goes on
and on but we have focused the search against nr only against proteins from soybean. Likewise I can do the same thing for the pea. In this case am using the taxid list. Let’s just redirect the first 10 lines of
that to the output, let’s see how fast this one is, a little bit more data. We will give that about 20 seconds and if
it does not get done I will cancel and show you the output I have already saved. That was pretty fast. Admittedly I have a machine with a lot of
memory I’m using. Again, we see other members of the pea family,
in this case the first few are from Glycine but a different species. Okay. So, suppose we wanted to get out all the soybean
proteins that are in the NR 5 database. We can easily do that here. I am using the blastdbcmd and this was one
of the command lines that was in my slide. So I will just tell it to give me all of the
FASTA formatted sequences and not going to show you the whole output, I have in the file
just to see that this does work. The first couple of them are right there. I can do exactly the same thing with the pea
family. It is not really add additional information
so I will not demo that one but it is in the handout. The last thing I want to show you is making
an accession list and using that to focus your search on a particular set of sequences. So I will get a list for human non-model RefSeqs. I have the Entrez query right there. This is actually a pretty fast way to get
them. I will just show you that on the web. So there are 53,000 of these. You could easily download these to a file. We want to do is get the Accession List and
I have already done that so the list is on my level directories. There is another way to do it, you could also
use E direct and that command line is also in the handout. I’m not going to bother running that. So I could do E direct exactly the same way. So I have the file on here, and these will
be the NP_ style reference sequences. So there are the first 10 of them. 50,000 in there. And then what I want to do is create a version
5 seqidlist so this uses a blast utility that comes with blast, the blastdb_aliastool. It’s going to make this binary SeqID file
that we will use to limit our search. Okay that file is there. This is my non-model SeqID list right there. The last thing I want to do is run the search. The same kind of options. Again I will go ahead and put this as standard
output instead of putting it on the file. Okay. So that worked just fine. So now, let me go ahead and go back to my
slides here. So, we have shown you the improvements we
have done to BLAST DBv5. Works faster with string lookups, much easier
for taxonomic filtering, works faster with accession lists and remember that this is
required for the GI-less databases that are coming. I will leave you with a few URLs for things
we talked about in the webinar and places to go for additional help, for example the
YouTube channel that is where the recording will be. I will leave it open for a few questions. I realize we are pretty much at the time but
I continue answering questions if there is anything. Dave is going to read me the questions and
then Christiam and I will attempt to answer them. [Dave Arndt] Someone asked about the Refseq_genome
databases, when will be in the BLAST DBv5. [Peter Cooper] That is a good question for
Christiam to answer. [no audio] We don’t have a date for when
RefSeq Genomes will be available in v5. We will make an announcement when the nucleotide
BLASTDBs are available in v5 format. [Peter Cooper] That means that the genome
databases will not be available in version 5 for a while. The proteins associated are in the regular
protein database. [Dave Arndt] Another question. What would you recommend for finding or narrowing
down taxids when you are looking for an order of animals when the taxonomy list in the database
is not correct or doesn’t seem to be correct for some of the entries when you do an alignment? [Peter Cooper] I’m not sure I understand the
question. So one thing to say is whether or not taxonomy
is correct or not or what people think about classification does not really matter so much
because that’s just what we use. You would have to use the taxids that we use
for that group. So that is all that I can recommend. You have to use ours taxids because those
are the ones we’ve assigned to the sequences. Whether or not that is a true classification
is open for debate but I’m not sure if that answered the question or not. [Dave Arndt] The next one is, is it possible
to filter by type species or type strains? [Peter Cooper] That is possible to do by making
an accession list. I don’t know that is an option to search for
the taxids in taxonomy but it may very well be. That’s a good question and in the document
I will look into that and answer that in the Question/Answer document. I know you can do that kind of search in proteins
for example to get sequences from type. [Dave Arndt] [unintelligible] Why can´t you
use the taxonomy from within the database to filter for higher level taxonomic categories? [Christiam Camacho] [unintelligible] Only
the terminal or leaf node taxids are stored in the database for each sequence. The BLAST database does include the taxids
for the entire taxonomic lineage since that would inflate the size of the BLAST databases
too much. [Peter Cooper] That is really for efficiency,
right Christiam? If we stored all of those lineages that would
be too much. There would be too many nodes in there, but
we did this in the demo, we limited to the pea family for example, but you do have to
get all of the leaf node taxids. [Peter Cooper] So anything in the question
pod we will address in the document that’s going to be provided in the next few days. Okay. Thank you everybody for coming. That’s going to conclude the webinar for today. Nice talking with you and we will talk with
you next time. Thank you.

5 thoughts on “NCBI Minute: Using organism (taxonomic) information with standalone BLAST

  1. Huge fan!. Thanks for doing these vids, they are very useful. With all respect, it's 2019, why is the sound quality so low? It makes listening a bit of a chore.

  2. Very informative.
    Basics covered very well.
    Really appreciate your efforts.
    Please keep on sharing such useful stuff.


Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top