Skip to content

Amplicon Read Simulation Guide

Simulate PacBio or ONT amplicon reads with realistic PCR length bias from MucOneUp diploid references.


Overview

The reads amplicon command produces full-length amplicon reads that mimic targeted long-range PCR products spanning the MUC1 VNTR region. This enables benchmarking of amplicon-based analysis pipelines such as VNTRPipeline (Madritsch et al. 2026) and PacMUCI (Vrbacka et al. 2025).

Key features:

  • Primer-based amplicon extraction from simulated haplotypes
  • PCR length bias modeling (shorter alleles amplify preferentially)
  • PBSIM3 template mode for full-length reads with realistic error profiles
  • PacBio (multi-pass CCS) and ONT (single-pass) platform support
  • Diploid and haploid input support

How it differs from reads pacbio:

Feature reads pacbio (WGS) reads amplicon
Strategy Random genome sampling Full-length templates
Read length Variable (length distribution) Fixed (amplicon length)
Coverage model Uniform across reference PCR bias (length-dependent)
Use case WGS benchmarking Amplicon pipeline benchmarking

Quick Start

# 1. Generate diploid reference with different VNTR lengths
muconeup --config config.json simulate \
  --fixed-lengths 40 --fixed-lengths 70 \
  --seed 42 --out-base output/sample

# 2a. PacBio amplicon reads (default)
muconeup --config config.json reads amplicon \
  output/sample.001.simulated.fa \
  --model-file /path/to/QSHMM-RSII.model \
  --coverage 500 --seed 42

# 2b. ONT amplicon reads
muconeup --config config.json reads amplicon --platform ont \
  output/sample.001.simulated.fa \
  --model-file /path/to/QSHMM-ONT-HQ.model \
  --coverage 500 --seed 42

Output: Aligned BAM with ~500 reads, biased toward the shorter (40-repeat) allele.


Pipeline Stages

The amplicon pipeline runs 8 stages:

Input FASTA (diploid)
  |
  v
1. Haplotype extraction ──> hap1.fa, hap2.fa
  |
  v
2. Amplicon extraction ──> amplicon_hap1.fa, amplicon_hap2.fa
   (primer binding site detection)
  |
  v
3. PCR bias computation ──> e.g., 350 reads for hap1, 150 for hap2
   (shorter allele gets more reads)
  |
  v
4. Template FASTA generation ──> 350 copies of hap1, 150 copies of hap2
  |
  v
5. PBSIM3 template mode ──> CLR reads (one per template copy)
  |
  v
6. CCS consensus ──> HiFi reads
  |
  v
7. BAM merge ──> single BAM with reads from both alleles
  |
  v
8. Alignment (minimap2) ──> final aligned BAM

For haploid input (single sequence), stages 1 and 3 are skipped.


Platform Selection

Use --platform to choose between PacBio and ONT amplicon simulation:

# PacBio HiFi (default)
muconeup --config config.json reads amplicon sample.fa \
  --model-file /path/to/QSHMM-SEQUEL.model

# Oxford Nanopore
muconeup --config config.json reads amplicon --platform ont sample.fa \
  --model-file /path/to/QSHMM-ONT-HQ.model

Stages 1-4 (extraction, PCR bias, template generation) are shared. The platforms differ in read generation and alignment:

Stage PacBio (--platform pacbio) ONT (--platform ont)
Config section pacbio_params ont_amplicon_params
pbsim3 pass_num 3+ (multi-pass CLR) 1 (single-pass)
Consensus CCS (multi-pass -> HiFi) None (skip)
minimap2 preset map-hifi map-ont
Error model QSHMM-SEQUEL.model etc. QSHMM-ONT-HQ.model etc.
Output suffix *_amplicon_hifi.bam *_amplicon_ont.bam
Tool dependencies pbsim3, ccs, samtools, minimap2 pbsim3, samtools, minimap2

ONT does not require CCS

The ONT pipeline skips CCS consensus entirely. Each pbsim3 template produces one single-pass read, which is converted directly to FASTQ and aligned. This means --coverage maps directly to the number of output reads (before alignment filtering).

ONT Configuration

The ONT amplicon pipeline reads from the ont_amplicon_params config section (separate from pacbio_params):

"ont_amplicon_params": {
    "model_type": "qshmm",
    "model_file": "reference/pbsim3/QSHMM-ONT-HQ.model",
    "threads": 8,
    "accuracy_mean": 0.95
}
Parameter Type Default Description
model_type string "qshmm" pbsim3 model type (errhmm or qshmm)
model_file string QSHMM-ONT-HQ.model pbsim3 ONT error model file
threads int 8 Number of threads
accuracy_mean float 0.95 Mean read accuracy

Available ONT models in reference/pbsim3/:

  • QSHMM-ONT-HQ.model -- standard ONT error profile
  • QSHMM-ONT.model -- quality score ONT model
  • QSHMM-ONT-HQ.model -- high-quality ONT model (e.g., Kit14/Dorado)

The --model-file CLI flag overrides ont_amplicon_params.model_file for ONT, and pacbio_params.model_file for PacBio.


PCR Length Bias Model

In long-range PCR, shorter amplicons amplify more efficiently than longer ones. This is modeled as exponential decay of per-cycle efficiency:

E(L) = E_max * exp(-alpha * L)

For two competing alleles, the yield ratio after c PCR cycles:

ratio = ((1 + E1) / (1 + E2)) ^ c

Example

With the default preset (calibrated to Madritsch et al. 2026):

Allele lengths Shorter allele fraction Longer allele fraction
2000 vs 2000 bp 50% 50%
2000 vs 3500 bp 70% 30%
1800 vs 4800 bp 84% 16%

Presets

Preset Description Use case
default KOD HS, 25 cycles, calibrated to ~70% bias at 1.5kb difference Realistic amplicon benchmarking
no_bias Equal 50/50 split regardless of length Control experiments
# Use default preset (realistic PCR bias)
muconeup --config config.json reads amplicon sample.fa \
  --model-file model.model

# No bias (equal allele coverage)
muconeup --config config.json reads amplicon sample.fa \
  --model-file model.model --pcr-preset no_bias

Stochastic Mode

By default, the bias is deterministic (same input = same split). Enable stochastic mode for realistic run-to-run variance:

muconeup --config config.json reads amplicon sample.fa \
  --model-file model.model --stochastic-pcr --seed 42

This uses a Galton-Watson branching process where each PCR cycle is modeled as independent binomial trials per molecule.

Custom Parameters

Override individual PCR parameters in config.json:

"amplicon_params": {
    "forward_primer": "GGAGAAAAGGAGACTTCGGCTACCCAG",
    "reverse_primer": "GCCGTTGTGCACCAGAGTAGAAGCTGA",
    "pcr_bias": {
        "preset": "default",
        "cycles": 30,
        "stochastic": true
    }
}

Or specify all parameters explicitly (no preset):

"pcr_bias": {
    "e_max": 0.90,
    "alpha": 0.00005,
    "cycles": 20,
    "denaturation_time": 15.0,
    "stochastic": false
}
Parameter Type Default Description
e_max float 0.95 Max per-cycle efficiency (short amplicons)
alpha float 0.00005 Length decay rate (bp^-1)
cycles int 25 Number of PCR cycles
denaturation_time float 10.0 Denaturation step duration (seconds)
stochastic bool false Enable Galton-Watson branching

Primer Configuration

Amplicon boundaries are defined by primer binding sites in config.json:

"amplicon_params": {
    "forward_primer": "GGAGAAAAGGAGACTTCGGCTACCCAG",
    "reverse_primer": "GCCGTTGTGCACCAGAGTAGAAGCTGA",
    "primer_source": "Wenzel et al. 2018 (PMID: 29520014)",
    "expected_product_range": [500, 15000]
}

The primers are the core sequences from Wenzel et al. 2018 (PS2/PS3), used across all published MUC1 VNTR PCR protocols (Wenzel 2018, Madritsch 2026, Vrbacka 2025).

How extraction works

  1. Forward primer is searched on the forward strand (exact match, case-insensitive)
  2. Reverse primer is reverse-complemented and searched on the forward strand
  3. The region from forward primer start to reverse primer end (inclusive) is extracted
  4. If either primer is not found or found multiple times, the pipeline raises an error

Size validation

expected_product_range is optional. When set, the extracted amplicon length is validated:

"expected_product_range": [500, 15000]

This means 500 <= amplicon_length <= 15000 bp. Amplicons outside this range raise an error.


Coverage Semantics

--coverage specifies the total number of template molecules. For PacBio, the actual HiFi read count may be lower due to CCS quality filtering (min_rq, min_passes). For ONT, each template produces one output read (no CCS filtering).

For diploid inputs, the total is split between alleles by the PCR bias model:

# 500 total template molecules, split by PCR bias
muconeup --config config.json reads amplicon sample.fa \
  --model-file model.model --coverage 500

With a 70%/30% bias, this produces ~350 templates for the shorter allele and ~150 for the longer allele.


CLI Reference

muconeup --config CONFIG reads amplicon [OPTIONS] INPUT_FASTAS...

Required:
  INPUT_FASTAS              One or more FASTA files

Options:
  --platform [pacbio|ont]    Sequencing platform (default: pacbio)
  --model-file PATH         pbsim3 model file (overrides config)
  --model-type [qshmm|errhmm]  pbsim3 model type (overrides config)
  --pcr-preset [default|no_bias]  PCR bias preset
  --stochastic-pcr          Enable stochastic PCR bias
  --coverage INT             Total template molecules (default: 30)
  --seed INT                 Random seed for reproducibility
  --out-dir PATH             Output directory (default: .)
  --out-base TEXT             Output base name
  --track-read-source         Accepted but raises an error (not supported in amplicon mode)

Read Source Tracking

The --track-read-source option is accepted by the CLI parser but raises an error at runtime. Read source tracking is not supported in amplicon mode.


Limitations

  • Read source tracking (--track-read-source) is not supported in amplicon mode
  • Primer matching is exact (no mismatch tolerance)
  • Single amplicon region per run

References