Config Generator
scripts/generate_config.sh is an interactive helper that generates ready-to-use config.yaml and samples.tsv files for the three pipelines in the Scholl Lab processing stack:
| Pipeline |
Direction |
Script |
| sm-alignment |
FASTQ → analysis-ready BAM |
sm-alignment |
| sm-vcf-calling |
BAM → filtered VCF |
sm-vcf-calling |
| hardnormly |
VCF → normalized/hard-filtered VCF |
this repo |
Quick Start
# Interactive — configure all three pipelines, write to ./configs/
scripts/generate_config.sh -o configs/
# Configure only the hardnormly workflow
scripts/generate_config.sh -p hardnormly -o config/
# Non-interactive (accept all defaults, write templates)
scripts/generate_config.sh -p hardnormly -n -o config/
Options
| Flag |
Description |
-p, --pipeline PIPELINE |
Pipeline to configure: alignment, calling, hardnormly, all (default: all). Repeatable. |
-o, --output-dir DIR |
Directory to write generated files (default: current directory) |
-n, --non-interactive |
Accept all defaults without prompting |
-y, --yes |
Skip the write-confirmation prompt |
-h, --help |
Show help and exit |
Output Files
| File |
Target |
sm-alignment-config.yaml |
Copy to sm-alignment/config/config.yaml |
sm-alignment-samples.tsv |
Fill in, copy to sm-alignment/config/samples.tsv |
sm-calling-config.yaml |
Copy to sm-vcf-calling/config/config.yaml |
sm-calling-samples.tsv |
Fill in, copy to sm-vcf-calling/config/samples.tsv |
hardnormly-config.yaml |
Copy to config/config.yaml in this repo |
sm-alignment Settings
Prompted settings for the FASTQ → BAM alignment pipeline:
Reference
| Setting |
Description |
Default |
| Reference genome FASTA |
Uncompressed reference (required) |
— |
| Genome build |
GRCh37 or GRCh38 |
GRCh38 |
| Known variant sites |
VCFs for BQSR (dbSNP, Mills, 1000G) — one per line |
— |
Paths
| Setting |
Description |
Default |
| Samples TSV |
Path to sample metadata file |
config/samples.tsv |
| FASTQ directory |
Directory containing input FASTQ files |
— |
| Output directory |
Output directory for BAMs |
results/alignment |
| Log subdirectory |
Subdirectory name for log files |
logs |
FASTQ / Read Group
| Setting |
Description |
Default |
| R1 suffix |
Filename suffix for read 1 |
_R1_001.fastq.gz |
| R2 suffix |
Filename suffix for read 2 |
_R2_001.fastq.gz |
| Platform |
Sequencing platform for PL read group tag |
ILLUMINA |
Trimming
| Setting |
Description |
Default |
| Enable BBDuk trimming |
Adapter/quality trimming with BBDuk |
false |
QC
| Setting |
Description |
Default |
| QC master switch |
Enable/disable all QC rules |
true |
| FastQC |
Run FastQC on raw FASTQs |
true |
| samtools stats |
Run samtools stats on final BAM |
true |
| samtools flagstat |
Run samtools flagstat on final BAM |
true |
| Picard metrics |
Run CollectMultipleMetrics |
true |
| Qualimap bamqc |
Deep coverage/quality analysis (high memory) |
false |
sm-alignment samples.tsv
Each row represents one FASTQ pair. Rows sharing the same project_sample value are merged into a single BAM.
| Column |
Required |
Description |
fastq_files_basename |
Yes |
Filename prefix before _R1_001.fastq.gz |
lane |
Yes |
Sequencing lane (e.g., L001) — used in read group ID |
project_sample |
Yes |
Logical sample name — rows are merged by this value |
mdc_project |
Yes |
Project identifier for PU read group tag |
subfolder |
No |
Subdirectory within FASTQ folder |
sm-vcf-calling Settings
Caller
| Choice |
Description |
mutect2 |
GATK Mutect2 (somatic, tumor-normal, tumor-only, germline) |
freebayes |
FreeBayes (germline) |
all |
Run both callers |
Reference & GATK Resources
| Setting |
Description |
Default |
| Reference genome FASTA |
Path to reference |
— |
| Genome build |
GRCh37 or GRCh38 |
GRCh38 |
| Panel of Normals VCF |
Required for Mutect2 |
— |
| gnomAD AF VCF |
Required for Mutect2 |
— |
| gnomAD common biallelic VCF |
Required for Mutect2 |
— |
Paths
| Setting |
Description |
Default |
| Samples TSV |
Path to sample metadata file |
config/samples.tsv |
| BAM directory |
Directory containing input BAMs |
— |
| Output directory |
Output directory for VCFs |
results/calling |
| BAM extension |
BAM suffix from sm-alignment |
.merged.dedup.bqsr.bam |
Scatter Strategy
| Mode |
Description |
chromosome |
One Mutect2 job per chromosome (fast, recommended) |
interval |
Fixed-size interval scatter (configure count) |
none |
No scatter — single job per sample |
PureCN (optional)
| Setting |
Description |
Default |
| Enable PureCN |
Copy number analysis using PureCN |
false |
| Genome |
hg19 or hg38 |
hg38 |
| Capture bait BED |
Required when PureCN is enabled |
— |
| normalDB.rds |
Pre-built normal database (optional) |
— |
sm-vcf-calling samples.tsv
| Column |
Required |
Description |
sample |
Yes |
Unique identifier used in output filenames |
tumor_bam |
Yes |
BAM basename without extension |
normal_bam |
No |
Matched normal BAM basename, or . if none |
analysis_type |
Yes |
tumor_only, tumor_normal, or germline |
hardnormly Workflow Settings
Reference
| Setting |
Description |
Default |
| Reference FASTA |
Path to reference FASTA |
— |
| Genome file |
Chromosome sizes for bedtools slop |
defaults/hg19.genome |
| Genome build |
GRCh37 or GRCh38 |
GRCh37 |
Paths
| Setting |
Description |
Default |
| VCF list |
File with one input VCF path per line |
input/vcfs.txt |
| Output directory |
Output directory for processed VCFs |
results/hardnormly |
Regions
| Setting |
Description |
Default |
| Include BED files |
Target capture / exome regions (one per line) |
— |
| Exclude BED files |
Blacklist / low-complexity regions (one per line) |
— |
| Slop |
Base pairs to pad include regions |
100 |
Generating exclude BED files
Use scripts/generate_exclusion_bed.sh -b hg19 -o ref/exclude_files/hg19_exclusion.bed or scripts/generate_exclusion_bed.sh -b hg38 -o ref/exclude_files/hg38_exclusion.bed to download and merge public exclusion sources for your build. Use generate-exclusion-bed only when you already have source BED files to merge.
Filtering
| Setting |
Description |
Default |
| Filter source |
caller preset, file, or none |
caller |
| Caller preset |
gatk or freebayes — selects built-in filter set |
gatk |
| Custom filters file |
Path to a 3-column filter TSV (used when source=file) |
— |
| Strip annotations |
Comma-separated INFO/ fields to remove before filtering (e.g. INFO/CSQ,INFO/ANN) |
— |
| Only PASS |
Remove soft-filtered variants from output |
false |
See Filters for the full filter expression format and built-in presets.
Processing
| Setting |
Description |
Default |
| Generate stats |
Run bcftools stats per sample |
true |
| Auto-index |
Index output VCF with tabix |
true |
| Plot stats |
Generate plots from stats (requires plot-vcfstats) |
false |
| Plot output directory |
Base directory for per-sample plots |
{output_folder}/plots |
Full Pipeline Example
# Step 1 — generate all three configs interactively
scripts/generate_config.sh -o project-configs/
# Step 2 — edit sample sheets
nano project-configs/sm-alignment-samples.tsv
nano project-configs/sm-calling-samples.tsv
# Step 3 — run sm-alignment
cp project-configs/sm-alignment-config.yaml sm-alignment/config/config.yaml
cd sm-alignment
snakemake --snakefile workflow/Snakefile --workflow-profile profiles/default
# Step 4 — run sm-vcf-calling (after alignment finishes)
cp project-configs/sm-calling-config.yaml sm-vcf-calling/config/config.yaml
cd ../sm-vcf-calling
snakemake --snakefile workflow/Snakefile --workflow-profile profiles/default
# Step 5 — run hardnormly normalization + filtering
ls results/calling/**/*.vcf.gz > input/vcfs.txt
cp project-configs/hardnormly-config.yaml config/config.yaml
cd ../hardnormly
snakemake --snakefile workflow/Snakefile --configfile config/config.yaml