Skip to content

MucOneUp

MUC1 VNTR simulation and analysis toolkit for genomics research


What is MucOneUp?

MucOneUp generates realistic MUC1 Variable Number Tandem Repeat (VNTR) sequences with customizable mutations and simulates sequencing reads across multiple platforms. Designed for genomics researchers studying MUC1 gene variation, it provides reproducible benchmarking, pipeline validation, and synthetic data generation for computational biology workflows.

Why MucOneUp?

Benchmark variant callers with known ground truth mutations Test clinical pipelines before diagnostic deployment Generate training data for machine learning models Explore VNTR diversity in population genetics studies Validate workflows with reproducible synthetic datasets


Scientific Context

The MUC1 gene (chr1:155,185,824-155,192,916, hg38) encodes a transmembrane glycoprotein containing a Variable Number Tandem Repeat (VNTR) region. This polymorphic region exhibits significant copy number variation (20-125 repeats) and structural complexity across human populations.

Clinical Significance:

  • Cancer susceptibility - Associated with gastric, breast, and ovarian cancer risk
  • Immunological function - Modulates innate and adaptive immune responses
  • Disease stratification - VNTR length and mutation status correlate with disease outcomes
  • Diagnostic challenges - Complex structure complicates accurate variant detection

Understanding MUC1 VNTR variation requires robust computational tools that generate realistic test data for algorithm development and validation.


Key Features

Realistic VNTR Simulation

Probability-based repeat transitions following biological constraints:

  • Configurable repeat distributions (normal/uniform sampling)
  • Canonical terminal block enforcement (6/6p → 7 → 8 → 9)
  • Diploid haplotype generation with independent alleles
  • Structure file support for reproducible mutations

Flexible Mutation Engine

Insert, delete, replace, or delete-insert operations:

  • Targeted or random mutation placement
  • Strict validation mode (enforce allowed repeat contexts)
  • Dual simulation (generate normal + mutated pairs)
  • Ground truth tracking in JSON statistics

Multi-Platform Read Simulation

Generate sequencing reads with platform-specific error profiles:

  • Illumina - w-Wessim2 integration for paired-end reads
  • Oxford Nanopore - NanoSim with diploid split-simulation
  • PacBio HiFi - pbsim3 with CCS consensus accuracy
  • Seed-based reproducibility across all platforms

Comprehensive Analysis

Built-in tools for downstream analysis:

  • ORF prediction - Identify open reading frames with orfipy
  • Toxic protein detection - Quantitative algorithm for ADTKD-MUC1 frameshift analysis
  • VNTR statistics - Analyze real-world repeat databases
  • SNaPshot validation - In silico PCR → digest → extension assay simulation

Batch Processing

Unix-style composable commands:

# Each command does one thing well
muconeup simulate --out-base sample
muconeup reads illumina sample.001.simulated.fa
muconeup analyze orfs sample.001.simulated.fa

Pipeline multiple files efficiently:

ls *.fa | parallel muconeup reads ont {} --coverage 50

Quick Start

Installation

# Clone repository
git clone https://github.com/berntpopp/MucOneUp.git
cd MucOneUp

# Install
make install

# Verify installation
muconeup --version
# Pull pre-built image (includes all simulators)
docker pull ghcr.io/berntpopp/muconeup/muconeup:latest

# Run simulation
docker run --rm \
  -v $(pwd)/data:/data \
  ghcr.io/berntpopp/muconeup/muconeup:latest \
  --config /app/config.json \
  simulate --out-base /data/sample
# Modern Python tooling (uv, ruff, mypy)
make init

# Run tests
make test

# Verify code quality
make check

Complete Workflow Example

Real-world pipeline for benchmarking variant callers with dupC mutation:

# Step 1: Generate normal + mutated diploid pair with SNPs
muconeup --config config.json simulate \
  --out-base dupC \
  --out-dir output \
  --output-structure \
  --mutation-name normal,dupC \
  --fixed-lengths 60 \
  --random-snps \
  --random-snp-density 0.5 \
  --random-snp-output-file output/dupC.tsv

# Step 2: Simulate Illumina reads at 200x coverage
muconeup --config config.json reads illumina \
  output/dupC.*.simulated.fa \
  --out-dir output \
  --coverage 200

# Step 3: Analyze ORFs and detect toxic proteins
muconeup --config config.json analyze orfs \
  output/dupC.*.simulated.fa \
  --out-dir output \
  --orf-aa-prefix MTSSV

Output files:

output/
├── dupC.001.normal.simulated.fa          # Normal diploid reference
├── dupC.001.normal.vntr_structure.txt    # Repeat structure
├── dupC.001.normal.simulation_stats.json # Ground truth metadata
├── dupC.001.mut.simulated.fa             # dupC mutated reference
├── dupC.001.mut.vntr_structure.txt       # Mutated structure
├── dupC.001.mut.simulation_stats.json    # Mutation coordinates
├── dupC.tsv                              # SNP positions (both haplotypes)
├── dupC.*.illumina.bam                   # Aligned reads (200x coverage)
└── dupC.*.pep.fa                         # Predicted ORFs with toxic detection

What this demonstrates:

  • Dual simulation (normal + mutated pair for controlled benchmarking)
  • Fixed VNTR length (60 repeats, eliminates length confounding)
  • SNP integration (0.5 per kb, realistic population variation)
  • Complete pipeline (haplotypes → reads → analysis)
  • Ground truth tracking (JSON files document exact mutation coordinates)

See Quick Start


Research Applications

1. Benchmarking Variant Callers

Evaluate variant caller accuracy with known ground truth:

# Generate test data with known mutation
muconeup --config config.json simulate \
  --mutation-name normal,dupC \
  --fixed-lengths 60 \
  --out-base benchmark

# Simulate reads
muconeup --config config.json reads illumina \
  benchmark.001.mut.fa --coverage 100

# Run your variant caller
gatk HaplotypeCaller -R reference.fa \
  -I benchmark_reads.bam -O calls.vcf

# Compare to ground truth in simulation_stats.json
jq '.mutations' benchmark.001.mut.simulation_stats.json

Use case: Validate clinical pipelines before diagnostic deployment.


2. Testing Mutation Detection Pipelines

Verify your pipeline detects specific mutations:

# Apply targeted mutation
muconeup --config config.json simulate \
  --mutation-name dupC \
  --mutation-targets 1,25 2,30 \
  --output-structure \
  --out-base mutation_test

# Check if your pipeline detects the dupC mutation
# Ground truth: mutation_test.001.simulation_stats.json

Dataset: Mutation positions, affected sequences, and haplotype assignments provided in JSON.


3. Generating Synthetic Training Data

Create large-scale datasets with controlled variation:

# Generate 100 samples with varying VNTR lengths
for i in {1..100}; do
  muconeup --config config.json simulate \
    --out-base training/sample_${i} \
    --seed ${i}
done

# Simulate reads at multiple coverage levels
for cov in 30 50 100; do
  muconeup --config config.json reads illumina \
    training/*.simulated.fa --coverage ${cov}
done

Use case: Train ML models for VNTR length prediction or mutation classification.


4. Exploring Population Diversity

Analyze VNTR structures from published research:

# Analyze real-world VNTR database (44 alleles)
muconeup --config config.json analyze vntr-stats \
  data/examples/vntr_database.tsv \
  --header \
  --structure-column vntr \
  -o population_stats.json

# Extract summary statistics
jq '.mean_repeats' population_stats.json  # 63.3
jq '.median_repeats' population_stats.json  # 70
jq '.transition_probabilities' population_stats.json

Dataset: Example VNTR database with 36 unique structures (42-85 repeats).


5. Reproducible Research

Ensure reproducibility with seed-based generation:

# Same seed → identical output (deterministic)
muconeup --config config.json simulate \
  --seed 42 \
  --out-base reproducible

muconeup --config config.json reads illumina \
  reproducible.001.simulated.fa \
  --seed 42 \
  --out-base reads

# Share config + seeds → fully reproducible datasets

Publication-ready: Enables method comparisons and collaborative research.


Example Output

MucOneUp generates comprehensive outputs for each simulation:

output/
├── sample.001.simulated.fa          # FASTA sequences (diploid haplotypes)
├── sample.001.vntr_structure.txt    # Repeat chain (1-2-3-X-A-B-6-7-8-9)
├── sample.001.simulation_stats.json # Metrics and ground truth
├── sample.pep.fa                     # Predicted ORF peptides
├── sample.orf_stats.txt              # Toxic protein scores
└── sample.illumina.bam               # Simulated reads (aligned)

Statistics include:

  • Haplotype lengths, repeat counts, GC content
  • Mutation positions and affected sequences
  • SNP integration summary with validation results
  • Runtime metrics and configuration snapshot

Documentation

  • Getting Started Installation, quick start tutorial, and core concepts

  • User Guides Detailed documentation for simulation, mutations, SNPs, and analysis

  • CLI Reference Complete command-line interface documentation

  • API Reference (coming soon) Python API documentation auto-generated from source

  • About Citation guide, license, and changelog


Citation

If you use MucOneUp in your research, please cite:

@software{muconeup2025,
  author = {Popp, Bernt},
  title = {MucOneUp: MUC1 VNTR Simulation and Analysis Toolkit},
  year = {2025},
  url = {https://github.com/berntpopp/MucOneUp},
  note = {Software version available at https://github.com/berntpopp/MucOneUp/releases}
}

A manuscript describing MucOneUp is in preparation.

See Citation Guide


Community

GitHub Repository: berntpopp/MucOneUp Docker Images: GitHub Container Registry Issue Tracker: Report bugs or request features Discussions: Questions and community support

GitHub Stars Tests Docker License: MIT


Development Status: Active | License: MIT | Maintained by: Bernt Popp