Read Source Tracking¶

Overview¶

The --track-read-source flag annotates every simulated read with its ground-truth origin: which haplotype it came from, where on the reference it was sampled, and whether it overlaps a mutation, SNP, or specific VNTR repeat unit. This produces a compressed TSV manifest alongside the standard output files.

Use cases: benchmarking variant callers, measuring aligner accuracy, quantifying allelic balance, and evaluating VNTR-aware pipelines against known ground truth.

Quick Start¶

# Step 1: simulate haplotypes
muconeup --config config.json simulate \
  --out-base sample --fixed-lengths 30 \
  --mutation-name dupC --mutation-targets 1,25

# Step 2: simulate Illumina reads with source tracking
muconeup --config config.json reads illumina sample.001.simulated.fa \
  --track-read-source

Works identically with reads ont and reads pacbio.

Amplicon pipelines

Read source tracking is not supported for reads amplicon (PacBio or ONT). Passing --track-read-source with amplicon mode raises an error.

Output Files¶

When --track-read-source is enabled, two additional files are produced:

File	Description
`{base}_repeat_coordinates.tsv`	Repeat unit coordinate map per haplotype
`{base}_read_manifest.tsv.gz`	Gzip-compressed per-read annotations

Read Manifest Schema¶

Column	Type	Description
`read_id`	str	Read name from FASTQ/BAM
`haplotype`	int	Source haplotype (1 or 2)
`ref_start`	int	0-based start on simulated reference
`ref_end`	int	0-based end on simulated reference
`strand`	str	`+` or `-`
`overlaps_vntr`	bool	Read overlaps the VNTR region
`repeat_units`	str	Comma-separated repeat indices (e.g. `3,4,5`) or `.`
`overlaps_mutation`	bool	Read overlaps a mutated repeat
`mutation_name`	str	Mutation name (e.g. `dupC`) or `.`
`overlaps_snp`	bool	Read overlaps an applied SNP

Booleans are serialized as lowercase true/false.

Repeat Coordinates Schema¶

Column	Type	Description
`haplotype`	int	Haplotype number
`index`	int	1-based repeat unit index
`repeat_type`	str	Repeat symbol (e.g. `X`, `Xm`)
`start`	int	0-based start coordinate
`end`	int	0-based end coordinate (exclusive)
`is_mutated`	bool	Whether this repeat carries a mutation
`mutation_name`	str	Mutation name or `.`
`snp_count`	int	Number of SNPs in this repeat
`snp_positions`	str	Comma-separated SNP positions or `.`

Querying the Manifest¶

Which reads carry the mutation?¶

import csv, gzip

with gzip.open("sample.001_read_manifest.tsv.gz", "rt") as f:
    reader = csv.DictReader(f, delimiter="\t")
    mutation_reads = [r for r in reader if r["overlaps_mutation"] == "true"]

print(f"Reads carrying mutation: {len(mutation_reads)}")
for r in mutation_reads:
    print(f"  {r['read_id']}  hap={r['haplotype']}  "
          f"pos={r['ref_start']}-{r['ref_end']}  repeats={r['repeat_units']}")

Allelic balance¶

import csv, gzip

with gzip.open("sample.001_read_manifest.tsv.gz", "rt") as f:
    rows = list(csv.DictReader(f, delimiter="\t"))

vntr_reads = [r for r in rows if r["overlaps_vntr"] == "true"]
mut_reads = [r for r in vntr_reads if r["overlaps_mutation"] == "true"]
print(f"VAF: {len(mut_reads)}/{len(vntr_reads)} = {len(mut_reads)/len(vntr_reads):.1%}")

Command-line filtering¶

# Count mutation-carrying reads
zcat sample.001_read_manifest.tsv.gz | awk -F'\t' '$8=="true"' | wc -l

# Extract haplotype 1 reads only
zcat sample.001_read_manifest.tsv.gz | awk -F'\t' '$2==1'

Platform-Specific Details¶

Read origin extraction varies by platform but produces the same manifest format:

Platform	Source of read positions
Illumina	Fragment coordinates captured during Wessim2-style fragment simulation
ONT	NanoSim encodes haplotype and position in read names
PacBio	pbsim3 MAF alignment files parsed before cleanup

Standalone `reads` command¶

When calling reads directly (not via simulate), the tracker reconstructs simulation metadata from the companion simulation_stats.json file. If the companion file is missing, a warning is emitted and a partial manifest (positions only, no VNTR/mutation annotations) is produced.

Performance¶

Tracking adds negligible overhead. The manifest is ~100 bytes per read; a 30x simulation of a 10 kb region (~15,000 reads) produces ~1.5 MB uncompressed. The gzip-compressed file is typically under 200 KB.

When --track-read-source is omitted (the default), zero tracking code executes.