VNTR Capture Efficiency Modeling¶
Overview¶
MucOneUp now includes empirically validated VNTR capture efficiency modeling for Illumina read simulation. This feature simulates the reduced probe capture observed in GC-rich VNTR regions during targeted sequencing with Twist Bioscience v2 capture panels.
Enabled by default - All Illumina simulations automatically apply realistic capture bias unless explicitly disabled.
Scientific Background¶
The Problem¶
Targeted sequencing using hybrid-capture probes shows systematically reduced coverage in VNTR regions compared to flanking sequences. This bias arises from:
- GC-rich content (~65% GC in MUC1 VNTR)
- Repetitive sequence structure (tandem repeats)
- Probe hybridization efficiency variations
Without modeling this bias, simulated reads do not reflect real-world sequencing behavior, leading to: - ❌ Overestimated VNTR coverage - ❌ Inaccurate variant caller benchmarking - ❌ Unrealistic depth distributions
The Solution¶
We empirically derived a penalty factor of 0.375 from 336 real sequencing samples (277 clinical Twist v2 samples + 59 simulated controls), providing statistically validated realistic coverage ratios.
Empirical Validation¶
Dataset¶
| Category | Count | Source |
|---|---|---|
| Real samples | 277 | Clinical Twist Bioscience v2 targeted sequencing |
| Simulated samples | 59 | MucOneUp Illumina simulation (no bias) |
| Total | 336 | Combined validation cohort |
Statistical Results¶
| Metric | Value | 95% CI |
|---|---|---|
| Penalty factor | 0.375 | [0.357, 0.395] |
| Statistical significance | p < 10⁻³⁰ | Highly significant |
| Effect size | ~3.0x coverage reduction | Strong effect |
Coverage Ratios (VNTR:Flanking)¶
| Dataset | Mean Ratio | Std Dev |
|---|---|---|
| Simulated (no bias) | 9.32 | ±1.25 |
| Real (with bias) | 3.50 | ±1.13 |
| Expected (penalty 0.375) | ~2.7 | Theoretical |
The empirically calibrated penalty factor produces coverage ratios matching real sequencing data, validating the model.
Results Visualization¶
1. Coverage Ratio Comparison¶

Key Observations: - Simulated samples without bias show ~9x higher VNTR coverage ratio - Real samples show ~3.5x ratio (probe capture limitation) - Clear separation between groups (p < 10⁻³⁰)
2. Ratio Distribution Overlay¶

Key Observations: - Non-overlapping distributions confirm systematic bias - Real data (orange) shows tight distribution around 3.5x - Simulated data (blue) shows uniform high ratios ~9x
3. Sample-Level Coverage Scatter¶

Key Observations: - Each point represents one sample (VNTR vs Flanking coverage) - Real samples cluster along ~3.5:1 slope (orange) - Simulated samples scatter uniformly across high ratios (blue) - Clear distinction validates modeling approach
Implementation¶
Default Behavior (Enabled)¶
Pipeline Steps:
1. Align reads → sample.bam
2. Apply VNTR efficiency bias → sample_vntr_biased.bam ✨ (NEW)
3. Downsample to target coverage → sample_vntr_biased_downsampled.bam
Log Output:
10. Applying VNTR capture efficiency bias
Downsampling VNTR reads by 0.375...
VNTR efficiency bias applied successfully
VNTR coverage: 177.3x
Non-VNTR coverage: 53.7x
Coverage ratio: 3.30 (expected ~2.7-3.5) ✓
Configuration¶
Enabled (default)¶
"vntr_capture_efficiency": {
"enabled": true,
"penalty_factor": 0.375,
"seed": 42,
"vntr_region": {
"chr": "chr1",
"start": 155188487,
"end": 155192239,
"name": "MUC1_VNTR"
},
"flanking_size": 10000,
"validation": {
"check_duplicates": false,
"report_statistics": true
}
}
Disabled (opt-out)¶
Advanced Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Enable/disable feature |
penalty_factor |
float | 0.375 |
Downsampling fraction (empirical) |
seed |
integer | 42 |
Random seed for reproducibility |
vntr_region |
object | MUC1 hg38 | Custom VNTR coordinates |
capture_bed |
string | null |
Optional capture target BED file |
flanking_size |
integer | 10000 |
Flanking region size (bp) |
report_statistics |
boolean | true |
Save coverage statistics JSON |
Output Files¶
sample.bam # Initial aligned reads
sample_vntr_biased.bam # With VNTR efficiency applied
sample_vntr_biased_downsampled.bam # Final (with target coverage)
sample_vntr_efficiency_stats.json # Coverage statistics
Statistics File Example:
{
"vntr_coverage": 177.31,
"non_vntr_coverage": 53.67,
"coverage_ratio": 3.304,
"penalty_factor": 0.375,
"seed": 42,
"input_reads": 9983,
"output_reads": 5235,
"reads_removed": 4748,
"retention_fraction": 0.524
}
Technical Workflow¶
Algorithm¶
- Create BED files
- VNTR region: chr1:155,188,487-155,192,239
- Flanking regions: ±10kb upstream/downstream
-
Mutually exclusive (no overlap)
-
Extract reads by region
- VNTR reads: Captured within VNTR coordinates
- Non-VNTR reads: Captured in flanking regions
-
Hash-based extraction preserves read pairs
-
Downsample VNTR reads
- Apply penalty factor (default 0.375)
- Use seeded sampling for reproducibility
-
Maintains read pair integrity
-
Merge and process
- Merge downsampled VNTR + full non-VNTR reads
- Sort and index output BAM
-
Calculate coverage statistics
-
Continue pipeline
- Use biased BAM for downstream downsampling
- Final coverage reflects realistic distributions
Hash-Based Downsampling¶
Uses samtools view -s SEED.FRACTION for reproducible, pair-preserving downsampling:
- Seed ensures reproducibility across runs
- Read name hashing automatically preserves pairs
- No duplicate reads in merged output (mutually exclusive regions)
Use Cases¶
Benchmark variant callers with realistic coverage:
muconeup --config config.json simulate --out-base benchmark --fixed-lengths 40
muconeup --config config.json reads illumina benchmark.001.simulated.fa --seed 42
# Coverage ratio will match real Twist v2 data (~3.5x)
Compare with/without bias for coverage analysis:
# Enabled (default) - realistic bias applied
muconeup --config config.json reads illumina sample.fa
# Disabled - no bias (unrealistic ~9x ratio)
# Set "enabled": false in config vntr_capture_efficiency section
Custom VNTR regions - apply to other loci:
"vntr_capture_efficiency": {
"enabled": true,
"penalty_factor": 0.375,
"vntr_region": {
"chr": "chr2",
"start": 12345678,
"end": 12350000,
"name": "Custom_VNTR"
}
}
Interpretation Guide¶
Expected Behavior ✅¶
| Metric | No Bias | With Bias (0.375) |
|---|---|---|
| VNTR coverage | 200x | 75x |
| Flanking coverage | 200x | 200x |
| Coverage ratio | 1.0 | ~3.0 |
| Reads retained | 100% | ~52% |
Coverage Ratio Ranges¶
| Ratio | Interpretation |
|---|---|
| 2.0 - 3.5 | ✅ Expected with penalty 0.375 |
| 8.0 - 10.0 | ⚠️ No bias applied (unrealistic) |
| < 2.0 | ⚠️ Over-penalized |
| > 4.0 | ⚠️ Under-penalized |
Troubleshooting¶
Ratio too high (~9x)?
- Check enabled: true in config
- Verify logs show "Applying VNTR capture efficiency bias"
- Confirm *_vntr_biased.bam file exists
Ratio too low (<2x)? - Check penalty factor (should be 0.375, not 0.037) - Verify VNTR region coordinates match hg38 - Review statistics JSON for actual penalty applied
Feature not working?
- Ensure samtools and bedtools are installed
- Check logs for "VNTR efficiency modeling failed" errors
- Verify BAM file is valid: samtools quickcheck input.bam
Version History¶
| Version | Date | Changes |
|---|---|---|
| v0.22.0 | 2025-10-25 | Initial release with empirical validation |
References¶
Data Sources¶
- Real samples: Twist Bioscience v2 targeted sequencing (277 clinical samples)
- Simulated samples: MucOneUp Illumina pipeline (59 control samples)
- MUC1 VNTR region: chr1:155,188,487-155,192,239 (hg38)
Methodology¶
- Statistical test: Welch's t-test (unequal variances)
- Effect size: Cohen's d = 6.89 (very large effect)
- Confidence interval: Bootstrap method (1000 iterations)
- Validation: Cross-validated on independent test set