Skip to content

VNTR Capture Efficiency Modeling

Overview

MucOneUp now includes empirically validated VNTR capture efficiency modeling for Illumina read simulation. This feature simulates the reduced probe capture observed in GC-rich VNTR regions during targeted sequencing with Twist Bioscience v2 capture panels.

Enabled by default - All Illumina simulations automatically apply realistic capture bias unless explicitly disabled.


Scientific Background

The Problem

Targeted sequencing using hybrid-capture probes shows systematically reduced coverage in VNTR regions compared to flanking sequences. This bias arises from:

  1. GC-rich content (~65% GC in MUC1 VNTR)
  2. Repetitive sequence structure (tandem repeats)
  3. Probe hybridization efficiency variations

Without modeling this bias, simulated reads do not reflect real-world sequencing behavior, leading to: - ❌ Overestimated VNTR coverage - ❌ Inaccurate variant caller benchmarking - ❌ Unrealistic depth distributions

The Solution

We empirically derived a penalty factor of 0.375 from 336 real sequencing samples (277 clinical Twist v2 samples + 59 simulated controls), providing statistically validated realistic coverage ratios.


Empirical Validation

Dataset

Category Count Source
Real samples 277 Clinical Twist Bioscience v2 targeted sequencing
Simulated samples 59 MucOneUp Illumina simulation (no bias)
Total 336 Combined validation cohort

Statistical Results

Metric Value 95% CI
Penalty factor 0.375 [0.357, 0.395]
Statistical significance p < 10⁻³⁰ Highly significant
Effect size ~3.0x coverage reduction Strong effect

Coverage Ratios (VNTR:Flanking)

Dataset Mean Ratio Std Dev
Simulated (no bias) 9.32 ±1.25
Real (with bias) 3.50 ±1.13
Expected (penalty 0.375) ~2.7 Theoretical

The empirically calibrated penalty factor produces coverage ratios matching real sequencing data, validating the model.


Results Visualization

1. Coverage Ratio Comparison

Coverage Ratio Boxplot

Key Observations: - Simulated samples without bias show ~9x higher VNTR coverage ratio - Real samples show ~3.5x ratio (probe capture limitation) - Clear separation between groups (p < 10⁻³⁰)

2. Ratio Distribution Overlay

Ratio Distribution Histogram

Key Observations: - Non-overlapping distributions confirm systematic bias - Real data (orange) shows tight distribution around 3.5x - Simulated data (blue) shows uniform high ratios ~9x

3. Sample-Level Coverage Scatter

Coverage Scatter Plot

Key Observations: - Each point represents one sample (VNTR vs Flanking coverage) - Real samples cluster along ~3.5:1 slope (orange) - Simulated samples scatter uniformly across high ratios (blue) - Clear distinction validates modeling approach


Implementation

Default Behavior (Enabled)

# VNTR efficiency is ON by default
muconeup --config config.json reads illumina sample.fa --seed 42

Pipeline Steps: 1. Align reads → sample.bam 2. Apply VNTR efficiency biassample_vntr_biased.bam(NEW) 3. Downsample to target coverage → sample_vntr_biased_downsampled.bam

Log Output:

10. Applying VNTR capture efficiency bias
  Downsampling VNTR reads by 0.375...
  VNTR efficiency bias applied successfully
  VNTR coverage: 177.3x
  Non-VNTR coverage: 53.7x
  Coverage ratio: 3.30 (expected ~2.7-3.5) ✓

Configuration

Enabled (default)

"vntr_capture_efficiency": {
  "enabled": true,
  "penalty_factor": 0.375,
  "seed": 42,
  "vntr_region": {
    "chr": "chr1",
    "start": 155188487,
    "end": 155192239,
    "name": "MUC1_VNTR"
  },
  "flanking_size": 10000,
  "validation": {
    "check_duplicates": false,
    "report_statistics": true
  }
}

Disabled (opt-out)

"vntr_capture_efficiency": {
  "enabled": false
}

Advanced Parameters

Parameter Type Default Description
enabled boolean true Enable/disable feature
penalty_factor float 0.375 Downsampling fraction (empirical)
seed integer 42 Random seed for reproducibility
vntr_region object MUC1 hg38 Custom VNTR coordinates
capture_bed string null Optional capture target BED file
flanking_size integer 10000 Flanking region size (bp)
report_statistics boolean true Save coverage statistics JSON

Output Files

sample.bam                              # Initial aligned reads
sample_vntr_biased.bam                  # With VNTR efficiency applied
sample_vntr_biased_downsampled.bam      # Final (with target coverage)
sample_vntr_efficiency_stats.json       # Coverage statistics

Statistics File Example:

{
  "vntr_coverage": 177.31,
  "non_vntr_coverage": 53.67,
  "coverage_ratio": 3.304,
  "penalty_factor": 0.375,
  "seed": 42,
  "input_reads": 9983,
  "output_reads": 5235,
  "reads_removed": 4748,
  "retention_fraction": 0.524
}


Technical Workflow

Algorithm

  1. Create BED files
  2. VNTR region: chr1:155,188,487-155,192,239
  3. Flanking regions: ±10kb upstream/downstream
  4. Mutually exclusive (no overlap)

  5. Extract reads by region

  6. VNTR reads: Captured within VNTR coordinates
  7. Non-VNTR reads: Captured in flanking regions
  8. Hash-based extraction preserves read pairs

  9. Downsample VNTR reads

  10. Apply penalty factor (default 0.375)
  11. Use seeded sampling for reproducibility
  12. Maintains read pair integrity

  13. Merge and process

  14. Merge downsampled VNTR + full non-VNTR reads
  15. Sort and index output BAM
  16. Calculate coverage statistics

  17. Continue pipeline

  18. Use biased BAM for downstream downsampling
  19. Final coverage reflects realistic distributions

Hash-Based Downsampling

Uses samtools view -s SEED.FRACTION for reproducible, pair-preserving downsampling: - Seed ensures reproducibility across runs - Read name hashing automatically preserves pairs - No duplicate reads in merged output (mutually exclusive regions)


Use Cases

Benchmark variant callers with realistic coverage:

muconeup --config config.json simulate --out-base benchmark --fixed-lengths 40
muconeup --config config.json reads illumina benchmark.001.simulated.fa --seed 42
# Coverage ratio will match real Twist v2 data (~3.5x)

Compare with/without bias for coverage analysis:

# Enabled (default) - realistic bias applied
muconeup --config config.json reads illumina sample.fa

# Disabled - no bias (unrealistic ~9x ratio)
# Set "enabled": false in config vntr_capture_efficiency section

Custom VNTR regions - apply to other loci:

"vntr_capture_efficiency": {
  "enabled": true,
  "penalty_factor": 0.375,
  "vntr_region": {
    "chr": "chr2",
    "start": 12345678,
    "end": 12350000,
    "name": "Custom_VNTR"
  }
}


Interpretation Guide

Expected Behavior ✅

Metric No Bias With Bias (0.375)
VNTR coverage 200x 75x
Flanking coverage 200x 200x
Coverage ratio 1.0 ~3.0
Reads retained 100% ~52%

Coverage Ratio Ranges

Ratio Interpretation
2.0 - 3.5 ✅ Expected with penalty 0.375
8.0 - 10.0 ⚠️ No bias applied (unrealistic)
< 2.0 ⚠️ Over-penalized
> 4.0 ⚠️ Under-penalized

Troubleshooting

Ratio too high (~9x)? - Check enabled: true in config - Verify logs show "Applying VNTR capture efficiency bias" - Confirm *_vntr_biased.bam file exists

Ratio too low (<2x)? - Check penalty factor (should be 0.375, not 0.037) - Verify VNTR region coordinates match hg38 - Review statistics JSON for actual penalty applied

Feature not working? - Ensure samtools and bedtools are installed - Check logs for "VNTR efficiency modeling failed" errors - Verify BAM file is valid: samtools quickcheck input.bam


Version History

Version Date Changes
v0.22.0 2025-10-25 Initial release with empirical validation

References

Data Sources

  • Real samples: Twist Bioscience v2 targeted sequencing (277 clinical samples)
  • Simulated samples: MucOneUp Illumina pipeline (59 control samples)
  • MUC1 VNTR region: chr1:155,188,487-155,192,239 (hg38)

Methodology

  • Statistical test: Welch's t-test (unequal variances)
  • Effect size: Cohen's d = 6.89 (very large effect)
  • Confidence interval: Bootstrap method (1000 iterations)
  • Validation: Cross-validated on independent test set