VNTR Simulation Guide¶

Comprehensive guide to generating MUC1 VNTR diploid haplotypes with MucOneUp.

Overview¶

The simulate command generates diploid haplotype sequences with customizable VNTR structures. It is the core functionality of MucOneUp and the starting point for all workflows.

What it does:

Generates two haplotype sequences (diploid reference)
Uses probability-based repeat transitions
Enforces canonical terminal blocks (6/6p → 7 → 8 → 9)
Supports fixed or random VNTR lengths
Applies mutations (optional)
Integrates SNPs (optional)
Outputs FASTA, structure files, and JSON statistics

What it does NOT do:

Simulate sequencing reads (use muconeup reads command)
Analyze sequences (use muconeup analyze command)

Following Unix philosophy: one command, one purpose.

Basic Usage¶

Minimal Command¶

muconeup --config config.json simulate --out-base output/sample

Result: Diploid FASTA with random VNTR lengths sampled from distribution in config.json.

Output:

output/
└── sample.001.simulated.fa

With Structure File¶

muconeup --config config.json simulate \
  --output-structure \
  --out-base output/sample

Output:

output/
├── sample.001.simulated.fa          # FASTA sequences
└── sample.001.vntr_structure.txt    # Repeat chain structure

Structure file example:

# Generated: 2025-10-20 15:30:45
# Configuration: config.json
# VNTR Lengths: haplotype_1=65 haplotype_2=58
haplotype_1 1-2-3-4-5-C-X-A-B-X-X-A-B-A-6-7-8-9
haplotype_2 1-2-3-4-5-C-X-B-X-A-X-B-A-6p-7-8-9

With Comprehensive Statistics¶

muconeup --config config.json simulate \
  --output-structure \
  --output-stats \
  --out-base output/sample

Output:

output/
├── sample.001.simulated.fa
├── sample.001.vntr_structure.txt
└── sample.001.simulation_stats.json

Statistics include:

Haplotype lengths, repeat counts, GC content
Repeat chain structures
Mutation details (if applied)
SNP integration summary
Runtime metrics
Configuration snapshot

VNTR Length Control¶

Fixed Lengths¶

Generate haplotypes with exactly N repeats:

# Both haplotypes: 60 repeats
muconeup --config config.json simulate \
  --fixed-lengths 60 \
  --out-base output/fixed_60

Use cases:

Controlled experiments
Benchmarking at specific lengths
Reproducible test data

Random Lengths (Distribution Sampling)¶

Sample from distribution defined in config.json:

muconeup --config config.json simulate \
  --out-base output/random_length

Distribution configured in config.json:

{
  "length_model": {
    "distribution_type": "normal",
    "mean_repeats": 63.3,
    "median_repeats": 70,
    "min_repeats": 42,
    "max_repeats": 85
  }
}

Use cases:

Population-level variation
Diverse training datasets
Realistic heterogeneity

Series Generation (Parameter Sweeps)¶

Generate multiple simulations across a length range:

# Generate 5 samples: 40, 50, 60, 70, 80 repeats
muconeup --config config.json simulate \
  --fixed-lengths 40-80 \
  --simulate-series 5 \
  --out-base output/series

Output:

output/
├── series.001.simulated.fa    # 40 repeats
├── series.002.simulated.fa    # 50 repeats
├── series.003.simulated.fa    # 60 repeats
├── series.004.simulated.fa    # 70 repeats
└── series.005.simulated.fa    # 80 repeats

Progress tracking:

Simulating 5 iterations  [################---]  80%  00:01:23

Use cases:

Length-dependent benchmarking
Coverage optimization studies
Algorithm parameter tuning

Reference Assembly Selection¶

Choose human genome assembly (hg19 or hg38):

# Use hg38 (default: hg19)
muconeup --config config.json simulate \
  --reference-assembly hg38 \
  --out-base output/hg38_sample

Assembly differences:

Assembly	MUC1 Coordinates	Flanking Regions
hg19	chr1:155,158,000-155,165,000	Left/right constants for GRCh37
hg38	chr1:155,185,824-155,192,916	Left/right constants for GRCh38

Important: Ensure your reference genome matches the assembly used for simulation.

Structure File Input/Output¶

Generate from Predefined Structure¶

Use existing repeat chains instead of probability-based generation:

# Create structure file
cat > custom_structure.txt << EOF
haplotype_1 1-2-3-X-A-B-X-A-6-7-8-9
haplotype_2 1-2-X-A-X-B-A-X-6p-7-8-9
EOF

# Generate sequences from structure
muconeup --config config.json simulate \
  --input-structure custom_structure.txt \
  --out-base output/from_structure

Use cases:

Reproduce published VNTR structures
Test specific repeat compositions
Validate mutation application on known structures

Structure File Format¶

Header:

# Optional comments
# Generated: timestamp
# Configuration: config.json

Body:

haplotype_1 <repeat1>-<repeat2>-<repeat3>-...-<terminal_block>
haplotype_2 <repeat1>-<repeat2>-<repeat3>-...-<terminal_block>

Rules:

Dash-separated repeat symbols
Must end with terminal block (6 or 6p, then 7, 8, 9)
Symbols must be defined in config.json
Mutation markers (m suffix) added automatically when mutations applied

Probability-Based Generation¶

How It Works¶

Start with initial repeat (e.g., 1)
Sample next repeat from probability distribution
Append to chain
Repeat until target length reached
Enforce terminal block

Probability Matrix¶

Defined in config.json:

{
  "probabilities": {
    "1": {"2": 0.3, "3": 0.2, "X": 0.5},
    "2": {"1": 0.2, "3": 0.3, "A": 0.5},
    "X": {"A": 0.4, "B": 0.4, "X": 0.2},
    "A": {"B": 0.5, "X": 0.3, "A": 0.2},
    "B": {"A": 0.4, "X": 0.4, "B": 0.2}
  }
}

Example transition:

From repeat 1, next repeat sampled: - 2 with 30% probability - 3 with 20% probability - X with 50% probability

Customizing Probabilities¶

Update config.json to match real-world VNTR structures:

# Analyze existing VNTR database
muconeup --config config.json analyze vntr-stats \
  data/examples/vntr_database.tsv \
  --header \
  -o observed_probs.json

# Extract transition probabilities
jq '.transition_probabilities' observed_probs.json

# Update config.json with observed probabilities

Output Options¶

Output Directory¶

Specify where files are saved:

muconeup --config config.json simulate \
  --out-dir /path/to/output/ \
  --out-base sample

Default: Current working directory

Output Base Name¶

Prefix for all output files:

muconeup --config config.json simulate \
  --out-base my_simulation

Output:

my_simulation.001.simulated.fa
my_simulation.001.vntr_structure.txt
my_simulation.001.simulation_stats.json

Verbose Logging¶

Enable detailed logging:

muconeup --verbose --config config.json simulate \
  --out-base output/sample

Alternative:

muconeup --log-level DEBUG --config config.json simulate \
  --out-base output/sample

Mutation Application¶

Dual Simulation Mode¶

Generate normal and mutated pairs:

muconeup --config config.json simulate \
  --mutation-name normal,dupC \
  --mutation-targets 1,25 \
  --out-base output/dual

Output:

output/
├── dual.001.normal.fa                    # Normal diploid
├── dual.001.normal.vntr_structure.txt
├── dual.001.normal.simulation_stats.json
├── dual.001.mut.fa                       # Mutated diploid
├── dual.001.mut.vntr_structure.txt
└── dual.001.mut.simulation_stats.json

Use case: Benchmarking variant callers (known ground truth).

Targeted Mutations¶

Specify exact positions:

muconeup --config config.json simulate \
  --mutation-name dupC \
  --mutation-targets 1,25 2,30 \
  --output-structure \
  --out-base output/targeted

Mutation applied:

Haplotype 1, repeat position 25
Haplotype 2, repeat position 30

Structure file shows markers:

# Mutation Applied: dupC (Targets: [(1, 25), (2, 30)])
haplotype_1 1-2-3-...-Xm-...-6-7-8-9
haplotype_2 1-2-3-...-Am-...-6p-7-8-9

The m suffix indicates mutated positions.

Random Mutations¶

Let MucOneUp select positions:

muconeup --config config.json simulate \
  --mutation-name dupC \
  --random-mutation-targets 3 \
  --out-base output/random_mut

Behavior:

Selects 3 random positions
Respects allowed_repeats constraint (only mutates valid repeat types)
Records positions in simulation_stats.json

SNP Integration¶

Random SNPs¶

Generate random single nucleotide polymorphisms:

muconeup --config config.json simulate \
  --random-snps \
  --random-snp-density 1.0 \
  --out-base output/with_snps

Density: SNPs per 1000 bp (1.0 = ~1 SNP per kb)

File-Based SNPs¶

Apply SNPs from TSV file:

# Create SNP file (TSV format)
cat > snps.tsv << EOF
haplotype   position    ref alt
1   150 A   G
1   350 C   T
2   200 G   A
EOF

# Apply SNPs
muconeup --config config.json simulate \
  --snp-input-file snps.tsv \
  --out-base output/snp_file

File format:

haplotype: 1 or 2 (1-indexed)
position: 0-indexed position in final sequence
ref: Reference base (validated before application)
alt: Alternate base

Reproducibility¶

Seed-Based Generation¶

Ensure reproducible outputs:

# Run 1
muconeup --config config.json simulate \
  --seed 42 \
  --out-base run1

# Run 2 (identical to run 1)
muconeup --config config.json simulate \
  --seed 42 \
  --out-base run2

# Verify
diff run1.001.simulated.fa run2.001.simulated.fa
# Files are identical

Guarantees:

Same seed → identical repeat chains
Same seed → identical random SNPs
Platform-independent (Linux/macOS/Windows)

Requirements:

Identical config.json
Same MucOneUp version
Same Python version (3.10+)

Provenance Metadata¶

When using --output-stats, MucOneUp automatically records provenance metadata in *.simulation_stats.json:

{
  "provenance": {
    "software_version": "0.27.0",
    "config_fingerprint": "sha256:aed09353ff7...",
    "seed": 42,
    "start_time": "2025-11-03T11:48:44.249499+00:00",
    "end_time": "2025-11-03T11:48:44.256557+00:00",
    "duration_seconds": 0.007058,
    "command_line": "muconeup --config config.json simulate --seed 42"
  }
}

Config Fingerprints:

Configuration fingerprints use RFC 8785 JSON canonicalization—identical configs produce identical SHA-256 hashes across platforms. System paths (human_reference, tool paths) are excluded to ensure portability.

Command-Line Sanitization:

Secret patterns (--api-key, --password, --token, etc.) are automatically redacted as ***REDACTED*** to prevent credential leakage in published datasets.

Disable Provenance:

export MUCONEUP_ENABLE_PROVENANCE=false

Advanced Options¶

Custom Configuration File¶

Use non-default configuration:

muconeup --config /path/to/custom_config.json simulate \
  --out-base output/sample

Logging Levels¶

Control verbosity:

# Minimal output
muconeup --log-level ERROR --config config.json simulate \
  --out-base output/sample

# Detailed debug output
muconeup --log-level DEBUG --config config.json simulate \
  --out-base output/sample

# No logging
muconeup --log-level NONE --config config.json simulate \
  --out-base output/sample

Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL, NONE

Common Workflows¶

Benchmark Variant Caller¶

# Generate ground truth
muconeup --config config.json simulate \
  --mutation-name normal,dupC \
  --mutation-targets 1,25 \
  --fixed-lengths 60 \
  --output-structure \
  --out-base benchmark

# Simulate reads (separate command)
muconeup --config config.json reads illumina \
  benchmark.001.mut.fa --coverage 100

Generate Training Dataset¶

# Generate 100 samples with varying lengths
for i in {1..100}; do
  muconeup --config config.json simulate \
    --seed ${i} \
    --out-base training/sample_${i}
done

Test Mutation Detection¶

# Apply targeted mutation
muconeup --config config.json simulate \
  --mutation-name dupC \
  --mutation-targets 1,25 2,30 \
  --output-structure \
  --out-base mutation_test

# Check ground truth
jq '.mutations' mutation_test.001.simulation_stats.json

Troubleshooting¶

Issue: Invalid Repeat Symbol¶

Error:

ValueError: Invalid repeat symbol 'Z' not found in config

Solution:

Ensure all repeat symbols in structure file or mutations are defined in config.json:

{
  "repeats": {
    "1": "AAGGAGACTTCGGCTACCCAGAGAAGTTCAGTGCCCAGCTCTACTGAGAAGAATGCTGTG",
    "2": "AGTATGACCAGCAGCGTACTCTCCAGCCACAGCCCCGGTTCAGGCTCCTCCACCACTCAG",
    "Z": "SEQUENCE_FOR_Z"
  }
}

Issue: Terminal Block Missing¶

Error:

ValueError: Structure missing terminal block (6/6p → 7 → 8 → 9)

Solution:

When using --input-structure, ensure chains end with terminal block:

# Correct
haplotype_1 1-2-3-X-A-B-6-7-8-9

# Incorrect (missing terminal block)
haplotype_1 1-2-3-X-A-B

Issue: Mutation Target Invalid¶

Error:

MutationError: Mutation target (1, 150) exceeds haplotype length (60)

Solution:

Mutation positions must be within VNTR length:

# Haplotype has 60 repeats, so position must be ≤60
muconeup --config config.json simulate \
  --fixed-lengths 60 \
  --mutation-targets 1,25  # Valid (25 ≤ 60)

Next Steps¶

Learn More:

mutations guide (coming soon) - Apply and validate mutations
snps guide (coming soon) - Advanced SNP workflows
Configuration Reference - Customize repeat definitions

Try Workflows:

Workflows (coming soon)
Workflows (coming soon)

Command Reference¶

muconeup --config CONFIG simulate [OPTIONS]

Required:
  --out-base TEXT              Output filename base

VNTR Length:
  --fixed-lengths INT|RANGE    Fixed repeat count (e.g., 60 or 40-80)
  --simulate-series INT        Number of iterations for series

Mutations:
  --mutation-name TEXT         Mutation name (or "normal,name" for dual)
  --mutation-targets TEXT      Targets as "hap,pos" (e.g., "1,25 2,30")
  --random-mutation-targets INT  Random target count

SNPs:
  --random-snps                Generate random SNPs
  --random-snp-density FLOAT   SNPs per 1000 bp (default: 1.0)
  --snp-input-file PATH        TSV file with SNPs

Structure:
  --input-structure PATH       Input structure file
  --output-structure           Output structure file

Assembly:
  --reference-assembly TEXT    hg19 or hg38 (default: hg19)

Output:
  --out-dir PATH               Output directory (default: .)
  --output-stats               Output simulation statistics JSON

Reproducibility:
  --seed INT                   Random seed for reproducibility

Logging:
  --log-level LEVEL            DEBUG|INFO|WARNING|ERROR|CRITICAL|NONE
  --verbose, -v                Enable verbose output

VNTR Simulation Guide¶

Overview¶

Basic Usage¶

Minimal Command¶

With Structure File¶

With Comprehensive Statistics¶

VNTR Length Control¶

Fixed Lengths¶

Random Lengths (Distribution Sampling)¶

Series Generation (Parameter Sweeps)¶

Reference Assembly Selection¶

Structure File Input/Output¶

Generate from Predefined Structure¶

Structure File Format¶

Probability-Based Generation¶

How It Works¶

Probability Matrix¶

Customizing Probabilities¶

Output Options¶

Output Directory¶

Output Base Name¶

Verbose Logging¶

Mutation Application¶

Dual Simulation Mode¶

Targeted Mutations¶

Random Mutations¶

SNP Integration¶

Random SNPs¶

File-Based SNPs¶

Reproducibility¶

Seed-Based Generation¶

Provenance Metadata¶

Advanced Options¶

Custom Configuration File¶

Logging Levels¶

Common Workflows¶

Benchmark Variant Caller¶

Generate Training Dataset¶

Test Mutation Detection¶

Troubleshooting¶

Issue: Invalid Repeat Symbol¶

Issue: Terminal Block Missing¶

Issue: Mutation Target Invalid¶

Next Steps¶

Command Reference¶

See Also¶