非模式物种单细胞转录因子 SCENIC 数据库创建

非模式物种SCENIC 数据库创建

参考官方文档创建conda环境：https://github.com/aertslab/create_cisTarget_databases

鸭子与人的同源基因创建：

下载的GTF文件中可能包含gene_name这一记录，一般对应的是人类的同源基因，因而可以得到当前物种的基因与人类基因的对应关系。编写一个脚本用于提取

python gff_gene2symbol.py GCF_047663525.1.gff gene2symbol.txt

基于gene2symbol.txt，对motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl 进行过滤，只保留存在同源基因tbl

python filter_tbl.py gene2symbol.txt /share/backup01/database/SCENIC/human/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl --output motifs-v10nr_clust-nr.duck-m0.001-o0.0.tbl

我们从 motifs-v10-nr.Sus.tbl提取出所有的motif，并过滤掉，不存在与singletons的motif

grep -v '^#' motifs-v10nr_clust-nr.duck-m0.001-o0.0.tbl|cut -f 1 |sort -u > motif_all.txt
ls ../v10nr_clust_public/singletons | cut -d '.' -f 1 > motif_nr.txt
grep -Ff motif_all.txt motif_nr.txt > motifs_duck.txt

提取以gene的起始的上游1kb，下游0bp, 用gene_name作为序列命名

python3 extract_upstream.py GCF_047663525.1.fna GCF_047663525.1.gff  gene_upstream1k.fasta 1000 0 -n Name

最后运行create_cistarget_motif_databases进行构建

# FASTA file with sequences per region IDs / gene IDs.
fasta_filename=gene_upstream1k.fasta
# Directory with motifs in Cluster-Buster format.
motifs_dir=../v10nr_clust_public/singletons
# File with motif IDs (base name of motif file in ${motifs_dir}).
motifs_list_filename=motifs_duck.txt
# cisTarget motif database output prefix.
db_prefix=duck_up1kb_down0kb
nbr_threads=24
conda activate create_cistarget_databases
create_cistarget_databases_dir=/share/work/biosoft/create_cisTarget_databases/create_cisTarget_databases/
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
    -f "${fasta_filename}" \
    -M "${motifs_dir}" \
    -m "${motifs_list_filename}" \
    -o "${db_prefix}" \
    -t "${nbr_threads}"

最终的结果文件：duck_up1kb_down0kb.regions_vs_motifs.rankings.feather

转录组因子基因列表

awk 'NR==FNR {gene[$2]=1; next} $1 in gene' gene2symbol.txt allTFs_hg38.txt >allTFs_duck.txt

发表于 2026-05-18 14:21
阅读 ( 354 )
分类：转录组

非模式物种单细胞转录因子 SCENIC 数据库创建

你可能感兴趣的文章

相关问题

0 条评论

作家榜 »