Pipeline¶
hardnormly processes a VCF through seven steps. Each step is optional depending on the flags you provide — the only required inputs are a VCF file (-v) and a reference FASTA (-f).
run-pipeline¶
flowchart TD
A[Input VCF] --> C[1. Genome File<br><code>--genome</code> or UCSC fetch]
C --> B{BED files<br>provided?}
B -- Yes --> D[2. BED Processing<br>normalize → merge → pad → index]
D --> E[3. Region Annotation<br>INCLUDE_REGION / EXCLUDE_REGION]
B -- No --> F
E --> F{<code>--strip-annotations</code>?}
F -- Yes --> G[4. Strip Annotations<br><code>bcftools annotate -x</code>]
F -- No --> H
G --> H[5. Normalize VCF<br><code>bcftools norm -m-any -a</code><br>split, atomize, left-align]
H --> I[6. Hard Filter<br>fill-tags → region filters →<br>inline filters → file filters]
I --> J{<code>--only-pass</code>?}
J -- Yes --> K[Remove filtered variants]
J -- No --> L[Keep all with FILTER tags]
K --> O[Output VCF]
L --> O
O --> M{<code>--generate-stats</code>?}
M -- Yes --> N[7. Stats & Plots<br><code>bcftools stats</code> + <code>plot-vcfstats</code>]
M -- No --> P[Done]
N --> P
generate-inclusion-bed¶
flowchart LR
A["BED files<br>(-b, repeatable)"] --> B[Normalize<br>4-column BED]
B --> C[Intersect<br>regions in ALL files]
C --> D["Slop padding<br>(--slop bp)"]
D --> E[Output BED]
F["Genome file<br>(-g)"] --> D
generate-exclusion-bed¶
flowchart LR
A["BED files<br>(-e, repeatable)"] --> B[Normalize<br>4-column BED]
B --> C[Union<br>multiinter + sort]
C --> D[Output BED]
Step-by-Step Details¶
Step 1: Genome File¶
The genome file lists chromosome names and sizes, needed by bedtools slop for region padding.
- With
--genome: Uses the file you provide directly. - Without
--genome: Queries the UCSC MySQL database for the genome build (default:hg19). Override the build with--genome-build hg38.
The auto-fetch retries up to 3 times if the UCSC connection fails.
Step 2: BED Region Processing¶
If you provide BED files via --include-bed or --exclude-bed, each goes through:
- Normalize — Standardize to 4-column BED with an annotation label
- Merge — Combine overlapping regions across multiple files
- Pad — Apply
--slopbp padding (default: 20) using the genome file (inclusion only) - Compress and index —
bgzip+tabixfor fast region lookups
Inclusion BED files are intersected (only regions present in all files are kept), then padded. Exclusion BED files are unioned (any region in any file is excluded).
Step 3: Region Annotation¶
Annotates the VCF with INFO fields based on BED overlap:
| BED Type | INFO Field | Value |
|---|---|---|
| Inclusion | INCLUDE_REGION |
1 if variant falls within an inclusion region |
| Exclusion | EXCLUDE_REGION |
1 if variant falls within an exclusion region |
These fields are used downstream in Step 6 to apply region-based filters.
Step 4: Strip Annotations (optional)¶
When --strip-annotations is provided (e.g., --strip-annotations INFO/CSQ,INFO/ANN), the specified INFO fields are removed before normalization. This is useful for stripping VEP or SnpEff annotations that bloat the VCF and interfere with downstream processing.
Uses bcftools annotate -x under the hood.
Step 5: VCF Normalization¶
Runs bcftools norm to:
- Split multiallelic sites into biallelic records (
-m-any) - Atomize complex variants into primitive SNPs and indels (
-a --atom-overlaps .) - Left-align indels against the reference FASTA
This ensures each variant is represented in a canonical, atomic form before filtering.
Step 6: Hard Filtering¶
This is the core step. It applies filters sequentially, tagging variants in the FILTER column.
fill-tags¶
Before filtering, bcftools +fill-tags computes derived fields:
FORMAT/VAF— Variant allele frequencyTYPE— Variant type (SNP, INDEL, etc.)- Other standard tags
These tags are available in filter expressions even if the original VCF doesn't include them.
Filter Application Order¶
- Region filters — Auto-generated from BED files:
NOT_IN_INCLUDE_REGION— Tags variants outside inclusion regionsIN_EXCLUDE_REGION— Tags variants inside exclusion regions
- Inline filters — From
--filtersflags - File-based filters — From
--filters-fileor--caller
Each filter either excludes (tags non-matching variants) or includes (keeps only matching variants). See Filters for details.
Output¶
By default, all variants are kept with filter tags in the FILTER column. Add --only-pass to remove tagged variants, keeping only those marked PASS.
Step 7: Stats and Plots (optional)¶
When --generate-stats is set:
- Runs
bcftools statson the output VCF - Saves the stats file as
<output>.stats.txt
When --plot-stats is also set (requires --plot-output-dir):
- Runs
plot-vcfstatsto generate visual summaries - Plot failures are non-fatal — the pipeline continues even if plotting fails (e.g., missing LaTeX dependencies)