Config Generator¶

scripts/generate_config.sh is an interactive helper that generates ready-to-use config.yaml and samples.tsv files for the three pipelines in the Scholl Lab processing stack:

Pipeline	Direction	Script
sm-alignment	FASTQ → analysis-ready BAM	sm-alignment
sm-vcf-calling	BAM → filtered VCF	sm-vcf-calling
hardnormly	VCF → normalized/hard-filtered VCF	this repo

Quick Start¶

# Interactive — configure all three pipelines, write to ./configs/
scripts/generate_config.sh -o configs/

# Configure only the hardnormly workflow
scripts/generate_config.sh -p hardnormly -o config/

# Non-interactive (accept all defaults, write templates)
scripts/generate_config.sh -p hardnormly -n -o config/

Options¶

Flag	Description
`-p, --pipeline PIPELINE`	Pipeline to configure: `alignment`, `calling`, `hardnormly`, `all` (default: `all`). Repeatable.
`-o, --output-dir DIR`	Directory to write generated files (default: current directory)
`-n, --non-interactive`	Accept all defaults without prompting
`-y, --yes`	Skip the write-confirmation prompt
`-h, --help`	Show help and exit

Output Files¶

File	Target
`sm-alignment-config.yaml`	Copy to `sm-alignment/config/config.yaml`
`sm-alignment-samples.tsv`	Fill in, copy to `sm-alignment/config/samples.tsv`
`sm-calling-config.yaml`	Copy to `sm-vcf-calling/config/config.yaml`
`sm-calling-samples.tsv`	Fill in, copy to `sm-vcf-calling/config/samples.tsv`
`hardnormly-config.yaml`	Copy to `config/config.yaml` in this repo

sm-alignment Settings¶

Prompted settings for the FASTQ → BAM alignment pipeline:

Reference¶

Setting	Description	Default
Reference genome FASTA	Uncompressed reference (required)	—
Genome build	`GRCh37` or `GRCh38`	`GRCh38`
Known variant sites	VCFs for BQSR (dbSNP, Mills, 1000G) — one per line	—

Paths¶

Setting	Description	Default
Samples TSV	Path to sample metadata file	`config/samples.tsv`
FASTQ directory	Directory containing input FASTQ files	—
Output directory	Output directory for BAMs	`results/alignment`
Log subdirectory	Subdirectory name for log files	`logs`

FASTQ / Read Group¶

Setting	Description	Default
R1 suffix	Filename suffix for read 1	`_R1_001.fastq.gz`
R2 suffix	Filename suffix for read 2	`_R2_001.fastq.gz`
Platform	Sequencing platform for `PL` read group tag	`ILLUMINA`

Trimming¶

Setting	Description	Default
Enable BBDuk trimming	Adapter/quality trimming with BBDuk	`false`

QC¶

Setting	Description	Default
QC master switch	Enable/disable all QC rules	`true`
FastQC	Run FastQC on raw FASTQs	`true`
samtools stats	Run `samtools stats` on final BAM	`true`
samtools flagstat	Run `samtools flagstat` on final BAM	`true`
Picard metrics	Run `CollectMultipleMetrics`	`true`
Qualimap bamqc	Deep coverage/quality analysis (high memory)	`false`

sm-alignment samples.tsv¶

Each row represents one FASTQ pair. Rows sharing the same project_sample value are merged into a single BAM.

Column	Required	Description
`fastq_files_basename`	Yes	Filename prefix before `_R1_001.fastq.gz`
`lane`	Yes	Sequencing lane (e.g., `L001`) — used in read group ID
`project_sample`	Yes	Logical sample name — rows are merged by this value
`mdc_project`	Yes	Project identifier for `PU` read group tag
`subfolder`	No	Subdirectory within FASTQ folder

sm-vcf-calling Settings¶

Caller¶

Choice	Description
`mutect2`	GATK Mutect2 (somatic, tumor-normal, tumor-only, germline)
`freebayes`	FreeBayes (germline)
`all`	Run both callers

Reference & GATK Resources¶

Setting	Description	Default
Reference genome FASTA	Path to reference	—
Genome build	`GRCh37` or `GRCh38`	`GRCh38`
Panel of Normals VCF	Required for Mutect2	—
gnomAD AF VCF	Required for Mutect2	—
gnomAD common biallelic VCF	Required for Mutect2	—

Paths¶

Setting	Description	Default
Samples TSV	Path to sample metadata file	`config/samples.tsv`
BAM directory	Directory containing input BAMs	—
Output directory	Output directory for VCFs	`results/calling`
BAM extension	BAM suffix from sm-alignment	`.merged.dedup.bqsr.bam`

Scatter Strategy¶

Mode	Description
`chromosome`	One Mutect2 job per chromosome (fast, recommended)
`interval`	Fixed-size interval scatter (configure `count`)
`none`	No scatter — single job per sample

PureCN (optional)¶

Setting	Description	Default
Enable PureCN	Copy number analysis using PureCN	`false`
Genome	`hg19` or `hg38`	`hg38`
Capture bait BED	Required when PureCN is enabled	—
normalDB.rds	Pre-built normal database (optional)	—

sm-vcf-calling samples.tsv¶

Column	Required	Description
`sample`	Yes	Unique identifier used in output filenames
`tumor_bam`	Yes	BAM basename without extension
`normal_bam`	No	Matched normal BAM basename, or `.` if none
`analysis_type`	Yes	`tumor_only`, `tumor_normal`, or `germline`

hardnormly Workflow Settings¶

Reference¶

Setting	Description	Default
Reference FASTA	Path to reference FASTA	—
Genome file	Chromosome sizes for `bedtools slop`	`defaults/hg19.genome`
Genome build	`GRCh37` or `GRCh38`	`GRCh37`

Paths¶

Setting	Description	Default
VCF list	File with one input VCF path per line	`input/vcfs.txt`
Output directory	Output directory for processed VCFs	`results/hardnormly`

Regions¶

Setting	Description	Default
Include BED files	Target capture / exome regions (one per line)	—
Exclude BED files	Blacklist / low-complexity regions (one per line)	—
Slop	Base pairs to pad include regions	`100`

Generating exclude BED files

Use scripts/generate_exclusion_bed.sh -b hg19 -o ref/exclude_files/hg19_exclusion.bed or scripts/generate_exclusion_bed.sh -b hg38 -o ref/exclude_files/hg38_exclusion.bed to download and merge public exclusion sources for your build. Use generate-exclusion-bed only when you already have source BED files to merge.

Filtering¶

Setting	Description	Default
Filter source	`caller` preset, `file`, or `none`	`caller`
Caller preset	`gatk` or `freebayes` — selects built-in filter set	`gatk`
Custom filters file	Path to a 3-column filter TSV (used when source=`file`)	—
Strip annotations	Comma-separated `INFO/` fields to remove before filtering (e.g. `INFO/CSQ,INFO/ANN`)	—
Only PASS	Remove soft-filtered variants from output	`false`

See Filters for the full filter expression format and built-in presets.

Processing¶

Setting	Description	Default
Generate stats	Run `bcftools stats` per sample	`true`
Auto-index	Index output VCF with `tabix`	`true`
Plot stats	Generate plots from stats (requires `plot-vcfstats`)	`false`
Plot output directory	Base directory for per-sample plots	`{output_folder}/plots`

Full Pipeline Example¶

# Step 1 — generate all three configs interactively
scripts/generate_config.sh -o project-configs/

# Step 2 — edit sample sheets
nano project-configs/sm-alignment-samples.tsv
nano project-configs/sm-calling-samples.tsv

# Step 3 — run sm-alignment
cp project-configs/sm-alignment-config.yaml sm-alignment/config/config.yaml
cd sm-alignment
snakemake --snakefile workflow/Snakefile --workflow-profile profiles/default

# Step 4 — run sm-vcf-calling (after alignment finishes)
cp project-configs/sm-calling-config.yaml sm-vcf-calling/config/config.yaml
cd ../sm-vcf-calling
snakemake --snakefile workflow/Snakefile --workflow-profile profiles/default

# Step 5 — run hardnormly normalization + filtering
ls results/calling/**/*.vcf.gz > input/vcfs.txt
cp project-configs/hardnormly-config.yaml config/config.yaml
cd ../hardnormly
snakemake --snakefile workflow/Snakefile --configfile config/config.yaml