单细胞转录组数据挖掘流程记录-LUAD(GSE131907)

单细胞转录组数据挖掘流程记录-LUAD(GSE131907)

数据介绍:

来自谷歌翻译:

我们对 44 名患者中 58 例肺腺癌衍生的 208,506 个细胞进行了单细胞 RNA 测序(scRNA-seq),涵盖原发肿瘤、淋巴结和脑转移及胸腔积液,此外还包括正常肺组织和淋巴结。丰富的单细胞谱描绘了肺腺癌进展的复杂细胞图谱,包括周围肿瘤微环境中的癌症、间质和免疫细胞。

数据下载地址:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131907



attachments-2026-01-3bgacP686965e0fe8edb0.png

数据下载:

wget -c "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE131nnn/GSE131907/suppl/GSE131907%5FLung%5FCancer%5Fraw%5FUMI%5Fmatrix.txt.gz" -O GSE131907_Lung_Cancer_raw_UMI_matrix.txt.gz
wget -c "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE131nnn/GSE131907/suppl/GSE131907%5FLung%5FCancer%5Fcell%5Fannotation.txt.gz" -O GSE131907_Lung_Cancer_cell_annotation.txt.gz


样本合并的,需要按样本分开方便后续度入质控,这里写了R代码:

#!/usr/bin/env Rscript
# ===============================
# 参数
# ===============================
anno_dir <- "./"   # 存放 LUNG_N01.txt 等文件的目录
umi_file <- "GSE131907_Lung_Cancer_raw_UMI_matrix.txt.gz"
out_dir  <- "./split"
dir.create(out_dir, showWarnings = FALSE)
# ===============================
# 读取 UMI 矩阵
# ===============================
message("Reading UMI matrix...")
umi <- read.table(
  gzfile(umi_file),
  header = TRUE,
  sep = "\t",
  check.names = FALSE,
  stringsAsFactors = FALSE
)
gene_col <- umi[, 1, drop = FALSE]
umi_mat  <- umi[, -1, drop = FALSE]
message("UMI matrix dimensions:")
print(dim(umi_mat))
# ===============================
# 处理每个 Sample
# ===============================
anno_files <- list.files(anno_dir, pattern = "\\.txt$", full.names = TRUE)
for (f in anno_files) {
  sample_name <- sub("\\.txt$", "", basename(f))
  message("Processing ", sample_name)
  # 读取 annotation
  anno <- read.table(
    f,
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )
  idx <- anno$Index
  idx <- intersect(idx, colnames(umi_mat))
  if (length(idx) == 0) {
    warning("No matching cells for ", sample_name)
    next
  }
  sub_mat <- umi_mat[, idx, drop = FALSE]
  out <- cbind(gene_col, sub_mat)
  out_file <- file.path(out_dir, paste0(sample_name, "_UMI.txt.gz"))
  write.table(
    out,
    gzfile(out_file),
    sep = "\t",
    quote = FALSE,
    row.names = FALSE
  )
  message("  Saved: ", out_file,
          " (", ncol(sub_mat), " cells)")
}
message("Done.")
attachments-2026-01-qZmaHyIb6965e19aa68ce.png

细胞注释文件也分开:

zcat GSE131907_Lung_Cancer_cell_annotation.txt.gz | \
awk -F'\t' '
NR==1 {
  header=$0
  next
}
{
  out=$3 ".txt"
  if (!(out in seen)) {
    print header > out
    seen[out]=1
  }
  print >> out
}'



数据分析:


样本准备mata文件:meta.tsv

SamplePatient_idTissueHistologySmokingPathologyEGFR_MUTEGFR_TypeStagesStages_Status
LUNG_N01P0001nLungADCNeverMDWTWTIALow
LUNG_N06P0006nLungADCExMDNANAIALow
LUNG_N08P0008nLungADCNeverMDL858RMUTIBLow
LUNG_N09P0009nLungADCExPDWTWTIIALow
LUNG_N18P0018nLungADCExMDdel19MUTIALow
LUNG_N19P0019nLungADCCurWDexon_20MUTIALow
LUNG_N20P0020nLungADCCurPDWTWTIALow
LUNG_N28P0028nLungADC(Double)CurNAWTWTIIIAHigh
LUNG_N30P0030nLungADCNeverNAdel19MUTIALow
LUNG_N31P0031nLungADCExNAWTWTIIIAHigh
LUNG_N34P0034nLungADCNeverMDWTWTIA3Low
LUNG_T06P0006tLungADCExMDNANAIALow
LUNG_T08P0008tLungADCNeverMDL858RMUTIBLow
LUNG_T09P0009tLungADCExPDWTWTIIALow
LUNG_T18P0018tLungADCExMDdel19MUTIALow
LUNG_T19P0019tLungADCCurWDexon_20MUTIALow
LUNG_T20P0020tLungADCCurPDWTWTIALow
LUNG_T25P0025tLungADC(Double)ExNAWTWTIALow
LUNG_T28P0028tLungADC(Double)CurNAWTWTIIIAHigh
LUNG_T30P0030tLungADCNeverNAdel19MUTIALow
LUNG_T31P0031tLungADCExNAWTWTIIIAHigh
LUNG_T34P0034tLungADCNeverMDWTWTIA3Low
EBUS_06P1006tL/BADCCurPDWTWTIVHigh
EBUS_28P1028tL/BADCExNAWTWTIVHigh
EBUS_49P1049tL/BADCCurPDWTWTIVHigh
BRONCHO_58P1058tL/BADCNeverPDNANAIVHigh
EBUS_10P1010mLNADCExNAWTWTIVHigh
BRONCHO_11P1011mLNADCNeverNAL858RMUTIVHigh
EBUS_12P1012mLNADCNeverNAWTWTIVHigh
EBUS_13P1013mLNADCCurPDWTWTIVHigh
EBUS_15P1015mLNADCCurNAexon_18_(G719X)__exon_20_(S768I)MUTIIIAHigh
EBUS_19P1019mLNADCNeverNAdel19MUTIVHigh
EBUS_51P1051mLNADCExNAWTWTIVHigh
LN_01P2001nLNADCCurPDWTWTIIBLow
LN_02P2002nLNADCNeverMDL858RMUTIBLow
LN_03P2003nLNADCExMDL858RMUTIIBLow
LN_04P2004nLNADCCurMDL858RMUTIA2Low
LN_05P2005nLNADCNeverMDdel19MUTIA3Low
LN_06P2006nLNADCNeverMDNANAIA3Low
LN_07P2007nLNADCExMDWTWTIALow
LN_08P2008nLNADCNeverMDWTWTIBLow
LN_11P2011nLNADCNeverMDL858RMUTIBLow
LN_12P2012nLNADCCurMDWTWTIALow
EFFUSION_06P1006PEADCCurPDWTWTIVHigh
EFFUSION_11P1011PEADCNeverNAL858RMUTIVHigh
EFFUSION_12P1012PEADCNeverNAWTWTIVHigh
EFFUSION_13P1013PEADCCurPDWTWTIVHigh
EFFUSION_64P1064PEADCExNANANAIVHigh
NS_02P3002mBrainADCNeverNAWTWTIVHigh
NS_03P3003mBrainADCNeverNAp.L858RMUTIVHigh
NS_04P3004mBrainADCExNAWTWTIVHigh
NS_06P3006mBrainADCExPDWTWTIVHigh
NS_07P3007mBrainADCNeverNAWTWTIVHigh
NS_12P3012mBrainADCNeverNAdel19_L858RMUTIVHigh
NS_13P3013mBrainADCExNAG719S_S768IMUTIVHigh
NS_16P3016mBrainADCCurPDWTWTIIIAHigh
NS_17P3017mBrainADCNeverNANANAIVHigh
NS_19P3019mBrainADCNeverNANANAIVHigh



这次的数据是​ h5 格式的也可以直接读入:​

cat ~/LUAD/data/meta.tsv | sed '1d' | \
parallel -j 10 --colsep '\t' '
Rscript $scripts/seurat_sc_qc.r \
  --count ~/LUAD/data/split/{1}_UMI.txt.gz \
  -p {1} --project {1} \
  --nUMI.min 100 \
  --nUMI.max 150000 \
  --nGene.min 200 \
  --nGene.max 10000 \
  --mito.gene.pattern "^MT.*-" \
  --percent_mito 20 \
  --log10GenesPerUMI 0.8 \
  --metadata ~/LUAD/data/{1}.txt \
  --metadata.col.name Patient_id Tissue Histology Smoking Pathology EGFR_MUT EGFR_Type Stages Stages_Status \
  --metadata.value {2} {3} {4} {5} {6} {7} {8} {9} {10}
'
#细胞周期和双细胞去除
cat ~/LUAD/data/meta.tsv|sed '1d'|while read Sample Patient_id Tissue Histology Smoking Pathology EGFR_MUT EGFR_Type Stages Stages_Status;do

Rscript $scripts/seurat_sc_cluster.r --rds $Sample.CellCycleScoring.qs  \
 --resolution 0.5 -d 30 \
 -p $Sample   -o $Sample --cpu 20


## 如果是 SCT 标准化需要加参数:--sct
## 如果要去除双细胞增加参数:--removeDoubletCells

Rscript $scripts/DoubletFinder.r -i $Sample/$Sample.qs \
    -p  $Sample   --annotations seurat_clusters --removeDoubletCells
done

#合并样本,
Rscript $scripts/merge_seurat_obj.r -i .doubletFinder.qs   -p all.sample.merged
# 分群聚类
Rscript $scripts/seurat_sc_cluster.r --cpu 10 --rds all.sample.merged.qs \ --integrate.method harmony --batch.id Sample \ --resolution 0.2 -d 50 \ -p luad.harmony -o luad.harmony


分析结果:

attachments-2026-01-OsZqEjbG6965e3b02a756.png





  • 发表于 2026-01-13 14:05
  • 阅读 ( 157 )
  • 分类:转录组

你可能感兴趣的文章

相关问题

0 条评论

请先 登录 后评论
omicsgene
omicsgene

生物信息

751 篇文章

作家榜 »

  1. omicsgene 751 文章
  2. 安生水 367 文章
  3. Daitoue 167 文章
  4. 生物女学霸 120 文章
  5. xun 94 文章
  6. rzx 87 文章
  7. 红橙子 81 文章
  8. Ti Amo 74 文章