扩增子的注释一般都会先聚类,但如果手里的序列非常少,只有几千条,那不一定能得到结果,或者就是想看看每条序列都是什么物种,那就可以使用blastn比对以后汇总结果
汇总代码如下
import re
with open("results_clean_batch1.txt", "r", encoding="utf-8") as file:
text = file.read()
matches = re.findall(r'(Query=.*?)(?=Query=|$)', text, re.DOTALL)
results = []
for match in matches:
query_id = re.search(r'Query= (\S+)', match).group(1)
species_matches = re.findall(r'>([^ ]+) ([^>]+?)\nLength', match, re.DOTALL)
identities = re.findall(r'Identities = (.*?)\,', match)
query_results = [query_id]
for accession, species in species_matches:
species_name = ' '.join(species.split())
if identities:
identity = identities.pop(0)
query_results.extend([accession, f"Species: {species_name}", f"Identity: {identity}"])
results.append(query_results)
# Output to a file
with open('output_batch2.txt', 'w') as f:
# Add header
header = "Sequence ID\tMatch 1 Accession\tMatch 1 Species\tMatch 1 Identity\tMatch 2 Accession\tMatch 2 Species\tMatch 2 Identity\tMatch 3 Accession\tMatch 3 Species\tMatch 3 Identity"
f.write(header + '\n')
# Write the results
for result in results:
f.write('\t'.join(result) + '\n')
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!