分离NR NT库,快速blast本地比对同源注释基因

分离NR NT库,快速blast本地比对同源注释基因

我们需要对自己的基因做注释,需要blast同源比对NCBI当中的NR NT库;通常做无参转录组,会组织出10几万的unigene ,如果比对全库的话,就太浪费时间了,我们可以根据NCBI的分类数据库将数据库分开,可下载如下文件,然后利用下面的perl脚本就可以把NR或者NT库分开成小库:


wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz


perl脚本如下,由于脚本会把分类文件全部读入内存,所有脚本内存消耗较大,建议确保100G以上的内存空间:


#The gi_taxid_nucl.dmp is about 160 MB and contains  two columns: the nucleotide's gi and taxid.
#The gi_taxid_prot.dmp is about 17 MB and contains two columns:  the protein's gi  and taxid.

#Divisions file (division.dmp):
#	division id				-- taxonomy database division id
#	division cde				-- GenBank division code (three characters)
#	division name				-- e.g. BCT, PLN, VRT, MAM, PRI...
#	comments




#0       |       BCT     |       Bacteria        |               |
#1       |       INV     |       Invertebrates   |               |
#2       |       MAM     |       Mammals |               |
#3       |       PHG     |       Phages  |               |
#4       |       PLN     |       Plants and Fungi        |               |
#5       |       PRI     |       Primates        |               |
#6       |       ROD     |       Rodents |               |
#7       |       SYN     |       Synthetic and Chimeric  |               |
#8       |       UNA     |       Unassigned      |       No species nodes should inherit this division assignment        |
#9       |       VRL     |       Viruses |               |
#10      |       VRT     |       Vertebrates     |               |
#11      |       ENV     |       Environmental samples   |       Anonymous sequences cloned directly from the environment        |

#nodes.dmp file consists of taxonomy nodes. The description for each node includes the following
#fields:
#        tax_id                                  -- node id in GenBank taxonomy database
#        parent tax_id                           -- parent node id in GenBank taxonomy database
#        rank                                    -- rank of this node (superkingdom, kingdom, ...)
#        embl code                               -- locus-name prefix; not unique
#        division id                             -- see division.dmp file
#        inherited div flag  (1 or 0)            -- 1 if node inherits division from parent
#        genetic code id                         -- see gencode.dmp file
#        inherited GC  flag  (1 or 0)            -- 1 if node inherits genetic code from parent
#        mitochondrial genetic code id           -- see gencode.dmp file
#        inherited MGC flag  (1 or 0)            -- 1 if node inherits mitochondrial gencode from parent
#        GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
#        hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet
#        comments                                -- free-text comments and citations




die "perl $0 <division.dmp> <nodes.dmp> <gi_taxid_n/p.dmp> <nr.fa/nt.fa> <od>" unless(@ARGV==5);

use Math::BigFloat;
use Bio::SeqIO;
use Bio::Seq;
use Data::Dumper;
use PerlIO::gzip;

use FileHandle;


use Cwd qw(abs_path getcwd);
if ($ARGV[3]=~/gz$/){
	open $Fa, "<:gzip", "$ARGV[3]" or die "$!";
}else{
	open $Fa, "$ARGV[3]" or die "$!";
}

my$od=$ARGV[4];

$od||=getcwd;
$od=abs_path($od);
unless(-d $od){	mkdir $od;}


my $in  = Bio::SeqIO->new(-fh => $Fa ,
                               -format => 'Fasta');


my %division2name=();
open IN,"$ARGV[0]" or die "$!";


while(<IN>){
	chomp;
	my($taxid,$name)=split(/\s+\|\s+/);
	$division2name{$taxid}=$name;
}
close(IN);


my%fout=();

for my$i (keys %division2name){
	
	
	
	my$FO=FileHandle->new("> $od/$division2name{$i}.fa");
	
	my $out = Bio::SeqIO->new(-fh => $FO ,  -format => 'Fasta');
	$fout{$i}=$out;

}

print "$ARGV[0] readed\n";

#print Dumper(\%fout);
#print Dumper(\%division2name);

open IN,"$ARGV[1]" or die "$!";

my%taxid2division=();


while(<IN>){
	chomp;
	my@tmp=split(/\s+\|\s+/);
	$taxid2division{$tmp[0]}=$tmp[4];
	#last if $.>100;
}
close(IN);

print "$ARGV[1] readed\n";


my %gi2taxid=();


open IN,"$ARGV[2]" or die "$!";

while(<IN>){
	chomp;
	my@tmp=split(/\s+/);
	$gi2taxid{$tmp[0]}=$tmp[1];
	#last if $.>100;
}
close(IN);


print "$ARGV[2] readed\n";


#print Dumper(\%gi2taxid);

while( my $seq = $in->next_seq()){
	
	my $id=$seq->id;
	my ($gi)=($id=~/gi\|(\d+)\|ref\|/);
	if(exists($gi2taxid{$gi}) and exists($taxid2division{$gi2taxid{$gi}})){
		$fout{$taxid2division{$gi2taxid{$gi}}}->write_seq($seq);
	}else{
		print "unknown gi:$gi\n";
	}
}



for my $i (keys %fout){
	
	$fout{$i}->close();
	
	
}




2021 NCBI又更新:https://www.omicsclass.com/article/1614

更多生物信息课程:https://study.omicsclass.com/index

  • 发表于 2019-06-14 15:24
  • 阅读 ( 7277 )
  • 分类:软件工具

1 条评论

请先 登录 后评论
omicsgene
omicsgene

生物信息

658 篇文章

作家榜 »

  1. omicsgene 658 文章
  2. 安生水 328 文章
  3. Daitoue 167 文章
  4. 生物女学霸 120 文章
  5. 红橙子 78 文章
  6. CORNERSTONE 72 文章
  7. xun 67 文章
  8. rzx 67 文章