Configuration Reference¶

Complete reference for config.json structure and customization.

Overview¶

MucOneUp requires a JSON configuration file specifying:

Repeat definitions - DNA sequences for each repeat symbol
Flanking constants - Left/right regions surrounding VNTR
Probability transitions - State transitions for repeat selection
Length model - Distribution parameters for VNTR lengths
Mutation definitions - Named mutations with operations
Tool paths - External tools for read simulation
Simulation parameters - Platform-specific read simulation settings

Location: Repository root contains config.json (example configuration).

File Structure¶

{
  "repeats": { ... },
  "constants": { ... },
  "probabilities": { ... },
  "length_model": { ... },
  "mutations": { ... },
  "tools": { ... },
  "read_simulation": { ... },
  "nanosim_params": { ... },
  "pacbio_params": { ... }
}

Repeats Section¶

Purpose¶

Defines DNA sequences for each repeat symbol used in VNTR chains.

Structure¶

{
  "repeats": {
    "1": "GCCCCACCCCTCCTCCCGCCGCGCCG",
    "2": "GCTCCACCCCTTCTCCCACCGCGCCG",
    "3": "GCCCCACCCCTTCTCCCACCGCGCCG",
    "X": "GCCCCACCCCTCCTCCCGCCGCGCCG",
    "A": "GCTCCACCCCTTCTCCCACCGCGCCG",
    "B": "GCCCCACCCCTTCTCCCACCGCGCCG",
    "C": "GCCCCACCCCTCTTCCCGCCGCGCCG",
    "6": "GCCCCACCCCTCCTCCCGCCGCGCCG",
    "6p": "GCTCCACCCCTTCTCCCACCGCGCCG",
    "7": "GCCCCACCCCTTCTCCCACCGCGCCG",
    "8": "GCCCCACCCCTCCTCCCGCCGCGCCG",
    "9": "GCTCCACCCCTTCTCCCACCGCGCCG"
  }
}

Rules¶

Key: Repeat symbol (string, typically 1-2 characters)
Value: DNA sequence (uppercase ACGT)
Terminal block required: Must define 6 or 6p, 7, 8, 9
Validation: Sequences validated on load (must be valid DNA)

Example: Custom Repeat¶

{
  "repeats": {
    "CUSTOM": "ATCGATCGATCGATCGATCGATCGAT"
  },
  "probabilities": {
    "1": {"CUSTOM": 0.2, "2": 0.8}
  }
}

Constants Section¶

Purpose¶

Defines left and right flanking regions surrounding the VNTR, assembly-specific.

Structure (Nested Format)¶

{
  "constants": {
    "hg19": {
      "left": "AGCAGGCAGTGCGGGGCCGCTGCTGCTG...",
      "right": "TGCTGCTGCTGCGGGGCCGCTGCTGCTG...",
      "vntr_start": 1000,
      "vntr_end": 8000
    },
    "hg38": {
      "left": "AGCAGGCAGTGCGGGGCCGCTGCTGCTG...",
      "right": "TGCTGCTGCTGCGGGGCCGCTGCTGCTG...",
      "vntr_start": 1500,
      "vntr_end": 8500
    }
  }
}

Structure (Flat Format, Auto-Converted)¶

{
  "constants": {
    "left": "AGCAGGCAGTGCGGGGCCGCTGCTGCTG...",
    "right": "TGCTGCTGCTGCGGGGCCGCTGCTGCTG...",
    "vntr_start": 1000,
    "vntr_end": 8000
  }
}

Note: Flat format assumed to be hg19 and auto-converted to nested format.

Fields¶

Field	Type	Description
`left`	string	DNA sequence upstream of VNTR
`right`	string	DNA sequence downstream of VNTR
`vntr_start`	integer	VNTR start position (0-indexed)
`vntr_end`	integer	VNTR end position (0-indexed)

Assembly-Specific Constants¶

MUC1 genomic coordinates differ between assemblies:

Assembly	Chromosome	Start	End
hg19	chr1	155,158,000	155,165,000
hg38	chr1	155,185,824	155,192,916

Ensure left and right constants match your chosen assembly.

Probabilities Section¶

Purpose¶

State transition probabilities for repeat selection during chain generation.

Structure¶

{
  "probabilities": {
    "1": {"2": 0.3, "3": 0.2, "X": 0.5},
    "2": {"1": 0.2, "3": 0.3, "A": 0.5},
    "3": {"1": 0.1, "2": 0.4, "X": 0.5},
    "X": {"A": 0.4, "B": 0.4, "X": 0.2},
    "A": {"B": 0.5, "X": 0.3, "A": 0.2},
    "B": {"A": 0.4, "X": 0.4, "B": 0.2},
    "C": {"X": 0.6, "A": 0.4}
  }
}

Rules¶

Outer key: Source repeat symbol
Inner key: Destination repeat symbol
Value: Transition probability (0.0 to 1.0)
Sum: Inner values should sum to 1.0 (normalized automatically if not)

Example¶

From repeat 1, transitions:

To 2 with 30% probability
To 3 with 20% probability
To X with 50% probability

Customizing Probabilities¶

From Real Data:

# Analyze VNTR database
muconeup --config config.json analyze vntr-stats \
  data/examples/vntr_database.tsv \
  --header \
  -o observed_probs.json

# Extract transitions
jq '.transition_probabilities' observed_probs.json

# Update config.json

Manual Specification:

{
  "probabilities": {
    "X": {"X": 0.8, "A": 0.2}  # X repeats frequently
  }
}

Length Model Section¶

Purpose¶

Distribution parameters for sampling VNTR repeat counts.

Normal Distribution¶

{
  "length_model": {
    "distribution_type": "normal",
    "mean_repeats": 63.3,
    "median_repeats": 70,
    "min_repeats": 42,
    "max_repeats": 85
  }
}

Behavior:

Sample from normal distribution (μ=63.3, σ derived from median)
Clip to [42, 85] range

Uniform Distribution¶

{
  "length_model": {
    "distribution_type": "uniform",
    "min_repeats": 40,
    "max_repeats": 100
  }
}

Behavior:

Sample uniformly from [40, 100]

Fields¶

Field	Type	Description	Required
`distribution_type`	string	"normal" or "uniform"	Yes
`mean_repeats`	float	Mean for normal distribution	If normal
`median_repeats`	float	Median for normal distribution	If normal
`min_repeats`	integer	Minimum repeat count	Yes
`max_repeats`	integer	Maximum repeat count	Yes

Mutations Section¶

Purpose¶

Define named mutations with operations (insert/delete/replace/delete_insert).

Structure¶

{
  "mutations": {
    "dupC": {
      "allowed_repeats": ["X", "A", "B"],
      "strict_mode": false,
      "changes": [
        {
          "operation": "insert",
          "sequence": "GCCCACGGTGTCACCTCGGCCCCGGACACCAGGCCGGCCCCGGGCTCCACCGCCCCCCCA",
          "position_offset": 0
        }
      ]
    },
    "deletion_example": {
      "allowed_repeats": ["X"],
      "strict_mode": true,
      "changes": [
        {
          "operation": "delete"
        }
      ]
    }
  }
}

Fields¶

Field	Type	Description
`allowed_repeats`	array	Valid repeat symbols for this mutation
`strict_mode`	boolean	Enforce allowed_repeats (error if violated)
`changes`	array	List of mutation operations

Mutation Operations¶

Insert:

{
  "operation": "insert",
  "sequence": "ATCGATCGATCG",
  "position_offset": 0
}

Inserts sequence at target position.

Delete:

{
  "operation": "delete"
}

Removes repeat at target position.

Replace:

{
  "operation": "replace",
  "sequence": "ATCGATCGATCG"
}

Substitutes repeat at target position with new sequence.

Delete-Insert:

{
  "operation": "delete_insert",
  "sequence": "ATCGATCGATCG",
  "position_offset": 0
}

Deletes repeat, then inserts sequence.

Strict Mode¶

strict_mode: false (Permissive):

If target repeat not in allowed_repeats, auto-convert to nearest allowed repeat
Emit warning
Simulation continues

strict_mode: true (Strict):

If target repeat not in allowed_repeats, raise error
Simulation fails

Example:

{
  "mutations": {
    "dupC": {
      "allowed_repeats": ["X"],
      "strict_mode": true,
      "changes": [...]
    }
  }
}

# This will fail if position 25 is not an "X" repeat
muconeup --config config.json simulate \
  --mutation-name dupC \
  --mutation-targets 1,25

Tools Section¶

Purpose¶

Paths to external tools for read simulation.

Structure¶

{
  "tools": {
    "reseq": "/path/to/reseq",
    "bwa": "/usr/bin/bwa",
    "samtools": "/usr/bin/samtools",
    "faToTwoBit": "/usr/bin/faToTwoBit",
    "pblat": "/usr/bin/pblat",
    "minimap2": "/usr/bin/minimap2",
    "pbsim3": "/usr/bin/pbsim3",
    "ccs": "/usr/bin/ccs"
  }
}

Auto-Detection¶

If paths not specified, MucOneUp searches system PATH:

{
  "tools": {}  // Auto-detect all tools
}

Conda Environments¶

When using conda environments, specify full paths:

# Find tool path in conda env
conda activate wessim
which reseq
# /home/user/miniconda3/envs/wessim/bin/reseq

# Update config.json
{
  "tools": {
    "reseq": "/home/user/miniconda3/envs/wessim/bin/reseq"
  }
}

Read Simulation Section¶

Purpose¶

Parameters for Illumina read simulation (w-Wessim2 pipeline).

Structure¶

{
  "read_simulation": {
    "simulator": "illumina",
    "read_length": 150,
    "fragment_size": 350,
    "fragment_sd": 50,
    "coverage": 100,
    "threads": 4,
    "reference_genome": "/path/to/hg38.fa",
    "error_model": "reseq_illumina",
    "seed": null
  }
}

Fields¶

Field	Type	Description	Default
`simulator`	string	"illumina", "ont", or "pacbio"	"illumina"
`read_length`	integer	Read length (bp)	150
`fragment_size`	integer	Mean insert size (bp)	350
`fragment_sd`	integer	Insert size std dev (bp)	50
`coverage`	integer	Target coverage depth	100
`threads`	integer	Parallel threads	4
`reference_genome`	string	Path to reference FASTA	Required
`error_model`	string	Error model name	"reseq_illumina"
`seed`	integer	Random seed (null = random)	null

NanoSim Parameters Section¶

Purpose¶

Parameters for Oxford Nanopore read simulation.

Structure¶

{
  "nanosim_params": {
    "training_data_path": "/path/to/nanosim/training",
    "coverage": 50,
    "min_read_length": 1000,
    "max_read_length": 10000,
    "correction_factor": 0.325,
    "enable_split_simulation": true,
    "seed": null
  }
}

Fields¶

Field	Type	Description	Default
`training_data_path`	string	NanoSim pre-trained model path	Required
`coverage`	integer	Target coverage depth	50
`min_read_length`	integer	Minimum read length (bp)	1000
`max_read_length`	integer	Maximum read length (bp)	10000
`correction_factor`	float	Coverage adjustment factor	0.325
`enable_split_simulation`	boolean	Diploid split-simulation mode	true
`seed`	integer	Random seed (null = random)	null

Diploid Split-Simulation¶

When enable_split_simulation: true and reference has 2 sequences:

Split diploid reference into haplotype1.fa and haplotype2.fa
Simulate each independently (coverage/2 each)
Merge reads from both haplotypes
Align merged reads to diploid reference

Result: Balanced allelic coverage (eliminates length-proportional bias).

PacBio Parameters Section¶

Purpose¶

Parameters for PacBio HiFi read simulation.

Structure¶

{
  "pacbio_params": {
    "model_type": "QSHMM",
    "model_file": "/path/to/pbsim3/models/QSHMM-RSII.model",
    "coverage": 30,
    "min_pass": 3,
    "max_pass": 15,
    "seed": null
  }
}

Fields¶

Field	Type	Description	Default
`model_type`	string	"QSHMM" or "ERRHMM"	"QSHMM"
`model_file`	string	pbsim3 model file path	Required
`coverage`	integer	Target coverage depth	30
`min_pass`	integer	Minimum CCS passes	3
`max_pass`	integer	Maximum CCS passes	15
`seed`	integer	Random seed (null = random)	null

Complete Example¶

{
  "repeats": {
    "1": "AAGGAGACTTCGGCTACCCAGAGAAGTTCAGTGCCCAGCTCTACTGAGAAGAATGCTGTG",
    "2": "AGTATGACCAGCAGCGTACTCTCCAGCCACAGCCCCGGTTCAGGCTCCTCCACCACTCAG",
    "X": "GCCCACGGTGTCACCTCGGCCCCGGACACCAGGCCGGCCCCGGGCTCCACCGCCCCCCCA",
    "A": "GCCCACGGTGTCACCTCGGCCCCGGAGAGCAGGCCGGCCCCGGGCTCCACCGCGCCCGCA",
    "B": "GCCCACGGTGTCACCTCGGCCCCGGAGAGCAGGCCGGCCCCGGGCTCCACCGCCCCCCCA",
    "6": "GCCCACGGTGTCACCTCGGCCCCGGACACCAGGCGGGCCCCGGGCTCCACCCCGGCCCCG",
    "6p": "GCCCACGGTGTCACCTCGGCCCCGGACACCAGGCCGGCCCCGGGCTCCACCCCGGCCCCG",
    "7": "GGCTCCACCGCCCCCCCAGCCCACGGTGTCACCTCGGCCCCGGACACCAGGCCGGCCCCG",
    "8": "GGCTCCACCGCCCCCCCAGCCCATGGTGTCACCTCGGCCCCGGACAACAGGCCCGCCTTG",
    "9": "GGCTCCACCGCCCCTCCAGTCCACAATGTCACCTCGGCCTCAGGCTCTGCATCAGGCTCA"
  },

  "constants": {
    "hg38": {
      "left": "AGCAGGCAGTGCGGGGCCGCTGCTGCTG...",
      "right": "TGCTGCTGCTGCGGGGCCGCTGCTGCTG...",
      "vntr_start": 1500,
      "vntr_end": 8500
    }
  },

  "probabilities": {
    "1": {"2": 0.3, "X": 0.7},
    "2": {"1": 0.2, "A": 0.8},
    "X": {"A": 0.5, "B": 0.5},
    "A": {"B": 0.6, "X": 0.4},
    "B": {"A": 0.5, "X": 0.5}
  },

  "length_model": {
    "distribution_type": "normal",
    "mean_repeats": 63.3,
    "median_repeats": 70,
    "min_repeats": 42,
    "max_repeats": 85
  },

  "mutations": {
    "dupC": {
      "allowed_repeats": ["X", "A", "B"],
      "strict_mode": false,
      "changes": [
        {
          "operation": "insert",
          "sequence": "GCCCACGGTGTCACCTCGGCCCCGGACACCAGGCCGGCCCCGGGCTCCACCGCCCCCCCA",
          "position_offset": 0
        }
      ]
    }
  },

  "tools": {
    "reseq": "/usr/bin/reseq",
    "bwa": "/usr/bin/bwa",
    "samtools": "/usr/bin/samtools"
  },

  "read_simulation": {
    "simulator": "illumina",
    "read_length": 150,
    "fragment_size": 350,
    "fragment_sd": 50,
    "coverage": 100,
    "threads": 4,
    "reference_genome": "/path/to/hg38.fa",
    "seed": 42
  },

  "nanosim_params": {
    "training_data_path": "/path/to/nanosim_training",
    "coverage": 50,
    "min_read_length": 1500,
    "max_read_length": 5000,
    "correction_factor": 0.325,
    "enable_split_simulation": true,
    "seed": 42
  },

  "pacbio_params": {
    "model_type": "QSHMM",
    "model_file": "/path/to/pbsim3/QSHMM-RSII.model",
    "coverage": 30,
    "min_pass": 3,
    "max_pass": 15,
    "seed": 42
  }
}

Validation¶

MucOneUp validates configuration on load:

Checked:

All repeat symbols referenced in probabilities exist in repeats
Terminal block symbols (6/6p, 7, 8, 9) defined
Probability values are valid (0.0 to 1.0)
Length model parameters are positive integers
Mutation sequences contain valid DNA (ACGT)
Tool paths exist (if specified)

Errors cause simulation to fail immediately.

Best Practices¶

Version Control Configuration

Commit config.json to version control with your simulation scripts for reproducibility.

Comment Your Mutations

JSON doesn't support comments, but you can use descriptive mutation names:

{
  "mutations": {
    "dupC_gastric_cancer_pmid12345": { ... }
  }
}

Test Configuration

Validate configuration before large-scale simulations:

muconeup --config config.json simulate --fixed-lengths 20 --out-base test

Platform-Specific Paths

Tool paths differ across systems. Use environment variables or separate configs for different machines:

# Linux
muconeup --config config_linux.json simulate ...

# macOS
muconeup --config config_macos.json simulate ...

Next Steps¶

Simulation Guide - Use your configuration for simulations
mutations guide (coming soon) - Define custom mutations
Workflows (coming soon) - Analyze real data to inform probabilities