blast XML结果解析,获得更多blast比对信息 perl

blast XML结果解析,获得更多比较信息 perl

blast format:  5 = BLAST XML,  格式 相比于  6 格式 包含更多的信息内容,包括比对的序列信息,长度信息等等;接下来我们可以用bioperl包来解析,获取自己想要的信息;

use strict;
use Bio::SearchIO;
use Data::Dumper;
my $fi     = $ARGV[0];
my $out = $ARGV[1];
my $format;

my $searchio     = new Bio::SearchIO(
    -format => "$format",
    -file   => "$fi"
open OUT,">$out" or die "$!";
print OUT "query_name\thit_name\tquery_length\thit_length\tidentity\talign_length\tmismatch\tgaps\tquery_start\tquery_end\thit_start\thit_end\tevalue\tbit_score\tquery_string\thit_string\tqfame\thfame\thit_description\n";
while(my $result = $searchio->next_result){
my  $algorithm_type =  $result->algorithm;
my $algorithm_version = $result ->algorithm_version;

while( my $hit = $result->next_hit ) {
# process the Bio::Search::Hit::HitI object
        while (my $hsp = $hit->next_hsp) {
        #print Dumper($hsp);
        # process the Bio::Search::HSP::HSPI object
            my $query_name   = $result->query_name;
            my $query_length = $result->query_length;
            my $query_description = $result->query_description;
            my $query_string   = $hsp->query_string;
            my $hit_string    = $hsp->hit_string;
            my $align        = $hsp->homology_string;
            $align           =~ s/ //g;
            my $qfame         = $hsp->query->frame;
            my $hfame           =$hsp->hit->frame;
            my $hit_name     = $hit->name;
            my $hit_description     = $hit->description;
            my $hit_length     = $hit->length;
            my $identity     = sprintf "%0.2f", $hsp->frac_identical;
            my $mismatch     =()= $hsp->seq_inds('hit','nomatch');
            my $gaps         = $hsp->gaps;
            my $align_length = $hsp->hsp_length;
            my $query_start  = $hsp->start('query');
            my $query_end    = $hsp->end('query');
            my $hit_start    = $hsp->start('hit');
            my $hit_end      = $hsp->end('hit');
            my $evalue       = $hsp->evalue;
            my $bit_score    = $hsp->bits;
            print OUT qq{$query_description\t$hit_name\t$query_length\t$hit_length\t$identity\t$align_length\t$mismatch\t$gaps\t$query_start\t$query_end\t$hit_start\t$hit_end\t$evalue\t$bit_score\t$query_string\t$hit_string\t$qfame\t$hfame\t$hit_description\n};


$result 对象 信息:  

Table of Methods

ResultalgorithmBLASTXalgorithm string   
Resultalgorithm_version2.2.4 [Aug-26-2002]algorithm version   
Resultquery_namegi20521485dbjAP004641.2query name
Resultquery_accessionAP004641.2query accession   
Resultquery_length3059query length   
Resultquery_descriptionOryza sativa … 977CE9AF checksum.query description   
Resultdatabase_nametest.fadatabase name   
Resultdatabase_letters1291number of residues in database   
Resultdatabase_entries5number of database entries   
Resultavailable_statisticseffectivespaceused … dblettersstatistics used   
Resultavailable_parametersgapext matrix allowgaps gapopenparameters used   
Resultnum_hits1number of hits   
Resulthits List of all Bio::Search::Hit::GenericHit objects for this Result   
Resultrewind Reset the internal iterator that dictates where next_hit() is pointing, useful for re-iterating through the list of hits   

Table 2.1: The data returned by the Result object methods when the report shown above is used as input.

$hit 对象包含的信息:

Note that many of the methods shown can be used to either get or set values, but we’re just showing what they get.

Hitnamegb443893124775hit name
Hitlength331Length of the Hit sequence  
Hitaccession443893accession (usually when this is a Genbank formatted id this will be an accession number - the part after the gb or emb )  
HitdescriptionLaForas sequencehit description  
Hitraw_score92hit raw score  
Hitsignificance2e-022hit significance  
Hitbits92.0hit bits  
Hithsps List of all Bio::Search::HSP::GenericHSP objects for this Hit  
Hitnum_hsps1number of HSPs in hit  
Hitlocus124775locus name  
Hitaccession_number443893accession number  
Hitrewind Resets the internal counter for next_hsp() so that the iterator will begin at the beginning of the list  

Table 2.2. The data returned by Hit object methods when the report shown above is used as input.

$hsp 包含的信息:

Many of the methods shown can be used to either get or set values, but we’re just showing what they get.

HSPexpect2e-022alias for evalue()
HSPfrac_identical0.884615384615385fraction identical
HSPfrac_conserved0.923076923076923fraction conserved (conservative and identical replacements aka “fraction similar”)
HSPgaps2number of gaps
HSPquery_stringDMGRCSSG ..query string from alignment
HSPhit_stringDIVQNSS …hit string from alignment
HSPthomology_stringD+ + SSGCN …string from alignment
HSPlength(‘total’)52length of HSP (including gaps)
HSPlength(‘hit’)50length of hit participating in alignment minus gaps
HSPlength(‘query’)t156length of query participating in alignment minus gaps
HSPthsp_length52Length of the HSP (including gaps) alias for length(‘total’)
HSPnum_conserved48number of conserved (conservative replacements, aka “similar”) residues
HSPnum_identical46number of identical residues
HSPtrank1rank of HSP
HSPseq_inds(‘query’,’identical’)(966,971,972,973,974,975 …)identical positions as array
HSPseq_inds(‘query’,’conserved-not-identical’)(967,969)conserved, but not identical positions as array
HSPseq_inds(‘query’,’conserved’)(966,967,969,971,973,974,975, …)conserved or identical positions as array
HSPseq_inds(‘hit’,’identical’)(197,202,203,204,205, …)identical positions as array
HSPseq_inds(‘hit’,’conserved-not-identical’)(198,200)conserved not identical positions as array
HSPseq_inds(‘hit’,’conserved’,1)(197,202-246)conserved or identical positions as array, with runs of consecutive numbers compressed
HSPbits92.0score in bits
HSPrange(‘query’)(2896,3051)start and end as array
HSPrange(‘hit’)(197,246)start and end as array
HSPpercent_identity88.4615384615385% identical
HSPstrand(‘hit’)1strand of the hit
HSPstrand(‘query’)1strand of the query
HSPstart(‘query’)2896start position from alignment
HSPend(‘query’)3051end position from alignment
HSPstart(‘hit’)197start position from alignment
HSPend(‘hit’)246end position from alignment
HSPmatches(‘hit’)(46,48)number of identical and conserved as array
HSPmatches(‘query’)(46,48)number of identical and conserved as array
HSPget_alnsequence alignmentBio::SimpleAlign object
HSPthsp_groupNot available in this reportGroup field from WU-BLAST reports run with -topcomboN or -topcomboE specified
HSPlinksNot available in this reportLinks field from WU-BLAST reports run with -links showing consistent HSP linking

Table 2.3. The data returned by HSP object methods when the report shown above is used as input.


  • 发表于 2022-05-13 13:16
  • 阅读 ( 39 )
  • 分类:perl

0 条评论

请先 登录 后评论


532 篇文章

作家榜 »

  1. omicsgene 532 文章
  2. 安生水 252 文章
  3. Daitoue 167 文章
  4. 生物女学霸 120 文章
  5. 红橙子 78 文章
  6. CORNERSTONE 72 文章
  7. 生信老顽童 48 文章
  8. landy 37 文章