Cell calling
To identify high-quality nuclei (a term used interchangeably with “barcodes”) using the filtered set of alignments, we implemented heuristic cutoffs for genomic context and sequencing depth indicative of high-quality nuclei. Specifically, we fit a smoothed spline to the log10
transformed unique Tn5 integration sites per nucleus (response) against the ordered log10
barcode rank (decreasing per-nucleus unique Tn5 integration site counts) using the
smooth.spline function (spar = 0.01) from base R (
Team, 2013). We then used the fitted values from the smoothed spline model to estimate the first derivative (slope), taking the local minima within the first 16,000 barcodes as a potential knee/inflection point (16,000 was selected to match the maximum number of input nuclei). We set the unique Tn5 library depth threshold to the lesser of 1,000 reads and the knee/inflection point, excluding all barcodes below the threshold. Spurious integration patterns throughout the genome can be representative of incomplete Tn5 integration, fragmented/low-quality nuclei, or poor sequence recovery, among other sources of technical noise. In contrast, high quality nuclei often demonstrate a strong aggregate accessibility signal near TSSs. Therefore, we implemented two approaches for estimating signal-noise ratios in our scATAC-seq data. First, nuclei below two standard deviations from the mean fraction of reads mapping to within 2-kb of TSSs were removed on a per-library basis. Then, we estimated TSS enrichment scores by calculating the average per-bp coverage of 2-kb windows surrounding TSSs, scaling by the average per-bp coverage of the first and last 100-bp in the window (background estimate; average of 1-100-bp and 1901-2000-bp), and smoothing the scaled signal with rolling-means (R package;
Zoo). Per barcode TSS enrichment scores were taken as the maximum signal within 250-bp of the TSS. Lastly, for each library, we removed any barcode with a proportion of reads mapping to chloroplast and mitochondrial genomes greater than two standard deviations from the mean of the library.
https://www.cell.com/cell/fulltext/S0092-8674(21)00493-1