Skip to content

Read Source Tracking

Overview

The --track-read-source flag annotates every simulated read with its ground-truth origin: which haplotype it came from, where on the reference it was sampled, and whether it overlaps a mutation, SNP, or specific VNTR repeat unit. This produces a compressed TSV manifest alongside the standard output files.

Use cases: benchmarking variant callers, measuring aligner accuracy, quantifying allelic balance, and evaluating VNTR-aware pipelines against known ground truth.


Quick Start

# Step 1: simulate haplotypes
muconeup --config config.json simulate \
  --out-base sample --fixed-lengths 30 \
  --mutation-name dupC --mutation-targets 1,25

# Step 2: simulate Illumina reads with source tracking
muconeup --config config.json reads illumina sample.001.simulated.fa \
  --track-read-source

Works identically with reads ont and reads pacbio.

Amplicon pipelines

Read source tracking is not supported for reads amplicon (PacBio or ONT). Passing --track-read-source with amplicon mode raises an error.


Output Files

When --track-read-source is enabled, two additional files are produced:

File Description
{base}_repeat_coordinates.tsv Repeat unit coordinate map per haplotype
{base}_read_manifest.tsv.gz Gzip-compressed per-read annotations

Read Manifest Schema

Column Type Description
read_id str Read name from FASTQ/BAM
haplotype int Source haplotype (1 or 2)
ref_start int 0-based start on simulated reference
ref_end int 0-based end on simulated reference
strand str + or -
overlaps_vntr bool Read overlaps the VNTR region
repeat_units str Comma-separated repeat indices (e.g. 3,4,5) or .
overlaps_mutation bool Read overlaps a mutated repeat
mutation_name str Mutation name (e.g. dupC) or .
overlaps_snp bool Read overlaps an applied SNP

Booleans are serialized as lowercase true/false.

Repeat Coordinates Schema

Column Type Description
haplotype int Haplotype number
index int 1-based repeat unit index
repeat_type str Repeat symbol (e.g. X, Xm)
start int 0-based start coordinate
end int 0-based end coordinate (exclusive)
is_mutated bool Whether this repeat carries a mutation
mutation_name str Mutation name or .
snp_count int Number of SNPs in this repeat
snp_positions str Comma-separated SNP positions or .

Querying the Manifest

Which reads carry the mutation?

import csv, gzip

with gzip.open("sample.001_read_manifest.tsv.gz", "rt") as f:
    reader = csv.DictReader(f, delimiter="\t")
    mutation_reads = [r for r in reader if r["overlaps_mutation"] == "true"]

print(f"Reads carrying mutation: {len(mutation_reads)}")
for r in mutation_reads:
    print(f"  {r['read_id']}  hap={r['haplotype']}  "
          f"pos={r['ref_start']}-{r['ref_end']}  repeats={r['repeat_units']}")

Allelic balance

import csv, gzip

with gzip.open("sample.001_read_manifest.tsv.gz", "rt") as f:
    rows = list(csv.DictReader(f, delimiter="\t"))

vntr_reads = [r for r in rows if r["overlaps_vntr"] == "true"]
mut_reads = [r for r in vntr_reads if r["overlaps_mutation"] == "true"]
print(f"VAF: {len(mut_reads)}/{len(vntr_reads)} = {len(mut_reads)/len(vntr_reads):.1%}")

Command-line filtering

# Count mutation-carrying reads
zcat sample.001_read_manifest.tsv.gz | awk -F'\t' '$8=="true"' | wc -l

# Extract haplotype 1 reads only
zcat sample.001_read_manifest.tsv.gz | awk -F'\t' '$2==1'

Platform-Specific Details

Read origin extraction varies by platform but produces the same manifest format:

Platform Source of read positions
Illumina Fragment coordinates captured during Wessim2-style fragment simulation
ONT NanoSim encodes haplotype and position in read names
PacBio pbsim3 MAF alignment files parsed before cleanup

Standalone reads command

When calling reads directly (not via simulate), the tracker reconstructs simulation metadata from the companion simulation_stats.json file. If the companion file is missing, a warning is emitted and a partial manifest (positions only, no VNTR/mutation annotations) is produced.


Performance

Tracking adds negligible overhead. The manifest is ~100 bytes per read; a 30x simulation of a 10 kb region (~15,000 reads) produces ~1.5 MB uncompressed. The gzip-compressed file is typically under 200 KB.

When --track-read-source is omitted (the default), zero tracking code executes.