VNTR Capture Efficiency Modeling¶

Overview¶

MucOneUp now includes empirically validated VNTR capture efficiency modeling for Illumina read simulation. This feature simulates the reduced probe capture observed in GC-rich VNTR regions during targeted sequencing with Twist Bioscience v2 capture panels.

Enabled by default - All Illumina simulations automatically apply realistic capture bias unless explicitly disabled.

Scientific Background¶

The Problem¶

Targeted sequencing using hybrid-capture probes shows systematically reduced coverage in VNTR regions compared to flanking sequences. This bias arises from:

GC-rich content (~65% GC in MUC1 VNTR)
Repetitive sequence structure (tandem repeats)
Probe hybridization efficiency variations

Without modeling this bias, simulated reads do not reflect real-world sequencing behavior, leading to: - ❌ Overestimated VNTR coverage - ❌ Inaccurate variant caller benchmarking - ❌ Unrealistic depth distributions

The Solution¶

We empirically derived a penalty factor of 0.375 from 336 real sequencing samples (277 clinical Twist v2 samples + 59 simulated controls), providing statistically validated realistic coverage ratios.

Empirical Validation¶

Dataset¶

Category	Count	Source
Real samples	277	Clinical Twist Bioscience v2 targeted sequencing
Simulated samples	59	MucOneUp Illumina simulation (no bias)
Total	336	Combined validation cohort

Statistical Results¶

Metric	Value	95% CI
Penalty factor	0.375	[0.357, 0.395]
Statistical significance	p < 10⁻³⁰	Highly significant
Effect size	~3.0x coverage reduction	Strong effect

Coverage Ratios (VNTR:Flanking)¶

Dataset	Mean Ratio	Std Dev
Simulated (no bias)	9.32	±1.25
Real (with bias)	3.50	±1.13
Expected (penalty 0.375)	~2.7	Theoretical

The empirically calibrated penalty factor produces coverage ratios matching real sequencing data, validating the model.

Results Visualization¶

1. Coverage Ratio Comparison¶

Coverage Ratio Boxplot

Key Observations: - Simulated samples without bias show ~9x higher VNTR coverage ratio - Real samples show ~3.5x ratio (probe capture limitation) - Clear separation between groups (p < 10⁻³⁰)

2. Ratio Distribution Overlay¶

Ratio Distribution Histogram

Key Observations: - Non-overlapping distributions confirm systematic bias - Real data (orange) shows tight distribution around 3.5x - Simulated data (blue) shows uniform high ratios ~9x

3. Sample-Level Coverage Scatter¶

Coverage Scatter Plot

Key Observations: - Each point represents one sample (VNTR vs Flanking coverage) - Real samples cluster along ~3.5:1 slope (orange) - Simulated samples scatter uniformly across high ratios (blue) - Clear distinction validates modeling approach

Implementation¶

Default Behavior (Enabled)¶

# VNTR efficiency is ON by default
muconeup --config config.json reads illumina sample.fa --seed 42

Pipeline Steps: 1. Align reads → sample.bam 2. Apply VNTR efficiency bias → sample_vntr_biased.bam ✨ (NEW) 3. Downsample to target coverage → sample_vntr_biased_downsampled.bam

Log Output:

10. Applying VNTR capture efficiency bias
  Downsampling VNTR reads by 0.375...
  VNTR efficiency bias applied successfully
  VNTR coverage: 177.3x
  Non-VNTR coverage: 53.7x
  Coverage ratio: 3.30 (expected ~2.7-3.5) ✓

Configuration¶

Enabled (default)¶

"vntr_capture_efficiency": {
  "enabled": true,
  "penalty_factor": 0.375,
  "seed": 42,
  "vntr_region": {
    "chr": "chr1",
    "start": 155188487,
    "end": 155192239,
    "name": "MUC1_VNTR"
  },
  "flanking_size": 10000,
  "validation": {
    "check_duplicates": false,
    "report_statistics": true
  }
}

Disabled (opt-out)¶

"vntr_capture_efficiency": {
  "enabled": false
}

Advanced Parameters¶

Parameter	Type	Default	Description
`enabled`	boolean	`true`	Enable/disable feature
`penalty_factor`	float	`0.375`	Downsampling fraction (empirical)
`seed`	integer	`42`	Random seed for reproducibility
`vntr_region`	object	MUC1 hg38	Custom VNTR coordinates
`capture_bed`	string	`null`	Optional capture target BED file
`flanking_size`	integer	`10000`	Flanking region size (bp)
`report_statistics`	boolean	`true`	Save coverage statistics JSON

Output Files¶

sample.bam                              # Initial aligned reads
sample_vntr_biased.bam                  # With VNTR efficiency applied
sample_vntr_biased_downsampled.bam      # Final (with target coverage)
sample_vntr_efficiency_stats.json       # Coverage statistics

Statistics File Example:

{
  "vntr_coverage": 177.31,
  "non_vntr_coverage": 53.67,
  "coverage_ratio": 3.304,
  "penalty_factor": 0.375,
  "seed": 42,
  "input_reads": 9983,
  "output_reads": 5235,
  "reads_removed": 4748,
  "retention_fraction": 0.524
}

Technical Workflow¶

Algorithm¶

Create BED files
VNTR region: chr1:155,188,487-155,192,239
Flanking regions: ±10kb upstream/downstream
Mutually exclusive (no overlap)
Extract reads by region
VNTR reads: Captured within VNTR coordinates
Non-VNTR reads: Captured in flanking regions
Hash-based extraction preserves read pairs
Downsample VNTR reads
Apply penalty factor (default 0.375)
Use seeded sampling for reproducibility
Maintains read pair integrity
Merge and process
Merge downsampled VNTR + full non-VNTR reads
Sort and index output BAM
Calculate coverage statistics
Continue pipeline
Use biased BAM for downstream downsampling
Final coverage reflects realistic distributions

Hash-Based Downsampling¶

Uses samtools view -s SEED.FRACTION for reproducible, pair-preserving downsampling: - Seed ensures reproducibility across runs - Read name hashing automatically preserves pairs - No duplicate reads in merged output (mutually exclusive regions)

Use Cases¶

Benchmark variant callers with realistic coverage:

muconeup --config config.json simulate --out-base benchmark --fixed-lengths 40
muconeup --config config.json reads illumina benchmark.001.simulated.fa --seed 42
# Coverage ratio will match real Twist v2 data (~3.5x)

Compare with/without bias for coverage analysis:

# Enabled (default) - realistic bias applied
muconeup --config config.json reads illumina sample.fa

# Disabled - no bias (unrealistic ~9x ratio)
# Set "enabled": false in config vntr_capture_efficiency section

Custom VNTR regions - apply to other loci:

"vntr_capture_efficiency": {
  "enabled": true,
  "penalty_factor": 0.375,
  "vntr_region": {
    "chr": "chr2",
    "start": 12345678,
    "end": 12350000,
    "name": "Custom_VNTR"
  }
}

Interpretation Guide¶

Expected Behavior ✅¶

Metric	No Bias	With Bias (0.375)
VNTR coverage	200x	75x
Flanking coverage	200x	200x
Coverage ratio	1.0	~3.0
Reads retained	100%	~52%

Coverage Ratio Ranges¶

Ratio	Interpretation
2.0 - 3.5	✅ Expected with penalty 0.375
8.0 - 10.0	⚠️ No bias applied (unrealistic)
< 2.0	⚠️ Over-penalized
> 4.0	⚠️ Under-penalized

Troubleshooting¶

Ratio too high (~9x)? - Check enabled: true in config - Verify logs show "Applying VNTR capture efficiency bias" - Confirm *_vntr_biased.bam file exists

Ratio too low (<2x)? - Check penalty factor (should be 0.375, not 0.037) - Verify VNTR region coordinates match hg38 - Review statistics JSON for actual penalty applied

Feature not working? - Ensure samtools and bedtools are installed - Check logs for "VNTR efficiency modeling failed" errors - Verify BAM file is valid: samtools quickcheck input.bam

Version History¶

Version	Date	Changes
v0.22.0	2025-10-25	Initial release with empirical validation

References¶

Data Sources¶

Real samples: Twist Bioscience v2 targeted sequencing (277 clinical samples)
Simulated samples: MucOneUp Illumina pipeline (59 control samples)
MUC1 VNTR region: chr1:155,188,487-155,192,239 (hg38)

Methodology¶

Statistical test: Welch's t-test (unequal variances)
Effect size: Cohen's d = 6.89 (very large effect)
Confidence interval: Bootstrap method (1000 iterations)
Validation: Cross-validated on independent test set