Usage¶
Pycashier has 4 subcommands to facilitate barcode extraction from illumina sequencing:
extract: extract sequences from standard illumina reads
merge: merge overlapping sequences PE sequencings prior to extracting
receipt: merge and summarize individual sample outputs of extract
scrna: extract UMI/Cell labeled expressed barcode reads from 10X unmapped sam
All of the above commands can be configured using the appropriate flags or additionally a config file.
Extract¶
The primary use case of pycashier is extracting 20bp sequences from illumina generated fastq files.
This can be accomplished with the below command where ./fastqs
is a directory containing all of your fastq files.
pycashier extract -i ./fastqs
Pycashier
will attempt to extract file names from your .fastq
files using the first string delimited by a period.
For example:
sample1.fastq
: sample1sample2.metadata_pycashier.will.ignore.fastq
: sample2
As pycashier extract
runs, two directories will be generated ./pipeline
and ./outs
, configurable with -p/--pipeline
and -o/--output
respectively.
Your pipeline
directory will contain all files and data generated while performing barcode extraction and clustering.
While outs
will contain a single .tsv
for each sample with the final barcode counts.
Expected output of pycashier extract
:
fastqs
└── sample.fastq
pipeline
├── pycashier.log
├── qc
│ ├── sample.html
│ └── sample.json
├── sample.q30.barcode.fastq
├── sample.q30.barcodes.r3d1.tsv
├── sample.q30.barcodes.tsv
└── sample.q30.fastq
outs
└── sample.q30.barcodes.r3d1.min12_off1.tsv
Note
If you wish to provide pycashier
with fastq files containing only your barcode you can supply the --skip-trimming
flag.
Receipt¶
Following a successful run of pycashier extract
, you can feed the outputs into pycashier receipt
to combine the data into one tsv
while
calculating the percent of total of each lineage within a sample.
By default pycashier
will also determine lineage overlap across samples.
Merge¶
In some cases your data may be from paired-end sequencing. If you have two fastq files per sample
that overlap on the barcode region they can be combined with pycashier merge
.
pycashier merge -i ./fastqgz
By default your output will be in mergedfastqs
. Which you can then pass back to pycashier
with pycashier extract -i mergedfastqs
.
For single read inputs, files are <sample>.fastq
now they should both contain R1 and R2.
For example:
sample.raw.R1.fastq.gz
,sample.raw.R2.fastq.gz
: samplesample.R1.fastq
,sample.R2.fastq
: samplesample.fastq
: fail, not R1 and R2
Scrna¶
If your DNA barcodes are expressed and detectable in 10X 3’-based transcriptomic sequencing,
then you can extract these tags with pycashier
and their associated umi/cell barcodes from the cellranger
output.
For pycashier scrna
we extract our reads from sam files.
This file can be generated using the output of cellranger count
.
For each sample you would run:
samtools view -f 4 $CELLRANGER_COUNT_OUTPUT/sample1/outs/possorted_genome_bam.bam > sams/sample1.unmapped.sam
This will generate a sam file containing only the unmapped reads.
Then similar to normal barcode extraction you can pass a directory of these unmapped sam files to pycashier and extract barcodes. You can also still specify extraction parameters that will be passed to cutadapt as usual.
Note
The default parameters passed to cutadapt are unlinked adapters and minimum barcode length of 10 bp.
pycashier scrna -i sams
When finished the outs
directory will have a .tsv
containing the following columns: Illumina Read Info, UMI Barcode, Cell Barcode, gRNA Barcode.
Note
This data can be noisy an it will be necessary to apply domain-specific ad-hoc filtering in order to confidently assign barcodes to cells. Typically, this can be a achieved with a combination of UMI and cell doublet filtering.
Configuration¶
Config File¶
You may generate and supply pycashier
with a toml config file using -c/--config
.
The expected structure is each command followed by key value pairs of flags with hypens replaced by underscores:
[global]
threads = 10
samples = "sample1,sample2,sample5"
[extract]
input = "fastqs"
unqualified_percent = 20
[merge]
input = "rawfastqgzs"
output = "mergedfastqs"
fastp_args = "-t 1"
You can also specify a global table as a fallback value for common flags such as threads
.
Tip
The order of precedence for arguments is command line > config file (command table) > config file (global table) > defaults.
For example if you were to use the above pycashier.toml
with pycashier extract -c pycashier.toml -t 15
.
The value used for threads would be 15.
All non-default values will be printed prior to execution. You can view all parameters with -v/--verbose
.
For convenience, you can update/create your config file with pycasher COMMAND --save-config [explicit|full]
.
“Explicit” will only save parameters already included in the config file or specified at runtime.
“Full” will include all parameters, again, maintaining preset values in config or specified at runtime.
See the cli reference for all options for each command.
Executables¶
Pycashier
depends on three executables (cutadapt
, starcode
, fastp
) existing on your $PATH
, you can force the use of a specific executable using environment variables of the form PYCASHIER_<NAME>
.
For example to override the cutadapt
used you could use something like the below command:
PYCASHIER_CUTADAPT="$HOME/important-software/cutadapt-v4" pycashier extract
Caveats¶
Pycashier will NOT overwrite intermediary files. If there is an issue in the process, please delete either the pipeline directory or the requisite intermediary files for the sample you wish to reprocess.
If there are reads from multiple lanes they should first be concatenated with
cat sample*R1*.fastq.gz > sample.R1.fastq.gz
Naming conventions:
Sample names are extracted from files using the first string delimited with a period. Please take this into account when naming sam or fastq files.
Each processing step will append information to the input file name to indicate changes, again delimited with periods.