CLI Reference¶

Complete command-line interface documentation for MucOneUp.

muconeup¶

MucOneUp - MUC1 VNTR diploid reference simulator.

Commands: simulate - Generate haplotypes reads - Simulate reads from FASTA analyze - Analyze FASTA (ORFs, stats, VNTR structure)

Usage:

muconeup [OPTIONS] COMMAND [ARGS]...

Options:

  -V, --version                   Show the version and exit.
  --config FILE                   Path to JSON configuration file.
  --log-level [DEBUG|INFO|WARNING|ERROR|CRITICAL|NONE]
                                  Set logging level.  [default: INFO]
  -v, --verbose                   Enable verbose output (sets log level to
                                  DEBUG).
  -h, --help                      Show this message and exit.

muconeup analyze¶

Analysis utilities.

Analyze ANY FASTA file. Works with MucOneUp outputs or external sequences. Requires --config at the top level.

Usage:

muconeup analyze [OPTIONS] COMMAND [ARGS]...

Options:

  -h, --help  Show this message and exit.

muconeup analyze orfs¶

Predict ORFs and detect toxic protein features from one or more FASTA files.

Supports batch processing following Unix philosophy: - Single file: muconeup analyze orfs file.fa --out-base analysis - Multiple files: muconeup analyze orfs file1.fa file2.fa file3.fa - Glob pattern: muconeup analyze orfs *.simulated.fa

When processing multiple files, --out-base is auto-generated from input filenames unless explicitly provided (which applies to all files).

Examples: # Single file with custom output name muconeup --config X analyze orfs sample.001.fa --out-base my_analysis

# Multiple files (auto-generated output names) muconeup --config X analyze orfs sample.001.fa sample.002.fa

# Glob pattern (shell expands) muconeup --config X analyze orfs sample.*.simulated.fa

Usage:

muconeup analyze orfs [OPTIONS] INPUT_FASTAS...

Options:

  --out-dir DIRECTORY   Output folder.  [default: .]
  --out-base TEXT       Base name for output files (auto-generated if
                        processing multiple files).
  --orf-min-aa INTEGER  Minimum ORF length in amino acids.  [default: 100]
  --orf-aa-prefix TEXT  Filter ORFs by prefix (e.g., MTSSV).
  -h, --help            Show this message and exit.

muconeup analyze snapshot-validate¶

Validate SNaPshot assay for MUC1 VNTR mutations.

Simulates complete SNaPshot workflow: PCR -> MwoI digest -> extension -> detection.

Examples: # Validate dupC mutation in a sample muconeup --config config.json analyze snapshot-validate sample.fa --mutation dupC

# Save results to JSON muconeup --config config.json analyze snapshot-validate sample.fa --mutation dupC --output results.json

Usage:

muconeup analyze snapshot-validate [OPTIONS] INPUT_FASTA

Options:

  --mutation TEXT  Mutation name to validate (e.g., 'dupC').  [required]
  --output PATH    Output JSON file for validation results (prints to stdout
                   if not specified).
  -h, --help       Show this message and exit.

muconeup analyze stats¶

Generate basic sequence statistics from one or more FASTA files.

Supports batch processing following Unix philosophy: - Single file: muconeup analyze stats file.fa --out-base stats - Multiple files: muconeup analyze stats file1.fa file2.fa file3.fa - Glob pattern: muconeup analyze stats *.simulated.fa

When processing multiple files, --out-base is auto-generated from input filenames unless explicitly provided (which applies to all files).

Examples: # Single file with custom output name muconeup --config X analyze stats sample.001.fa --out-base my_stats

# Multiple files (auto-generated output names) muconeup --config X analyze stats sample.001.fa sample.002.fa

# Glob pattern (shell expands) muconeup --config X analyze stats sample.*.simulated.fa

Usage:

muconeup analyze stats [OPTIONS] INPUT_FASTAS...

Options:

  --out-dir DIRECTORY  Output folder.  [default: .]
  --out-base TEXT      Base name for output files (auto-generated if
                       processing multiple files).
  -h, --help           Show this message and exit.

muconeup analyze vntr-stats¶

Analyze VNTR structures and compute transition probabilities.

Processes a CSV/TSV file containing VNTR structures, calculates statistics (min/max/mean/median repeat units), and builds a transition probability matrix showing the likelihood of each repeat unit following another.

The analysis removes duplicate VNTR structures and includes an "END" state representing sequence termination. Unknown repeat tokens (not in config) trigger warnings but don't cause failure.

Examples: # Analyze example VNTR database muconeup --config X analyze vntr-stats data/examples/vntr_database.tsv --header

# Use custom column and save to file muconeup --config X analyze vntr-stats data.csv \ --delimiter "," --structure-column "sequence" --output stats.json

# Column index without header muconeup --config X analyze vntr-stats data.tsv --structure-column 3

# Pipe to jq for filtering muconeup --config X analyze vntr-stats data/examples/vntr_database.tsv \ --header | jq '.mean_repeats'

Output JSON contains: - Statistics: min/max/mean/median repeat counts - Probabilities: State transition matrix (including END state) - Repeats: Known repeat dictionary from config

Usage:

muconeup analyze vntr-stats [OPTIONS] INPUT_FILE

Options:

  --structure-column TEXT  Column name (if header) or 0-based index containing
                           VNTR structure.  [default: vntr]
  --delimiter TEXT         Field delimiter for input file.  [default:      ]
  --header                 Specify if input file has header row.
  -o, --output PATH        Output JSON file (default: stdout).
  -h, --help               Show this message and exit.

muconeup reads¶

Read simulation utilities.

Simulate reads from ANY FASTA file. Works with MucOneUp outputs or external sequences. Requires --config at the top level.

Usage:

muconeup reads [OPTIONS] COMMAND [ARGS]...

Options:

  -h, --help  Show this message and exit.

muconeup reads amplicon¶

Simulate amplicon reads from one or more FASTA files.

Supports PacBio (default) and ONT platforms. PacBio uses multi-pass CCS consensus; ONT uses single-pass pbsim3 with map-ont alignment.

Note: --coverage specifies the total number of template molecules (before CCS filtering for PacBio). Final HiFi read count may be lower due to CCS quality filtering (min-rq, min-passes). For diploid inputs the total is split between alleles by the PCR bias model.

Pipeline (PacBio): 1. Extract amplicon region per haplotype (primer-based) 2. Apply PCR length bias to determine per-allele coverage 3. Simulate full-length reads (pbsim3 --strategy templ) 4. Generate HiFi consensus (CCS) 5. Align to reference (minimap2 map-hifi preset)

Pipeline (ONT): 1-3. Same extraction and PCR bias stages 4. Simulate single-pass reads (pbsim3 --strategy templ, pass_num=1) 5. BAM to FASTQ (no CCS) 6. Align to reference (minimap2 map-ont preset)

Examples: # Basic PacBio amplicon simulation muconeup --config X reads amplicon sample.fa \ --model-file /models/QSHMM-SEQUEL.model

# ONT amplicon simulation muconeup --config X reads amplicon --platform ont sample.fa \ --model-file /models/QSHMM-ONT-HQ.model

# High coverage with stochastic PCR bias muconeup --config X reads amplicon sample.fa \ --model-file /models/QSHMM-SEQUEL.model \ --coverage 1000 --stochastic-pcr --seed 42

# No PCR bias (equal coverage per allele) muconeup --config X reads amplicon sample.fa \ --model-file /models/QSHMM-SEQUEL.model \ --pcr-preset no_bias

Usage:

muconeup reads amplicon [OPTIONS] INPUT_FASTAS...

Options:

  --model-file FILE               Path to pbsim3 model file (overrides config
                                  if provided).
  --model-type [qshmm|errhmm]     pbsim3 model type (overrides config if
                                  provided).
  --pcr-preset [default|no_bias]  PCR bias preset profile (default: from
                                  config or 'default').
  --stochastic-pcr                Enable stochastic PCR bias (Galton-Watson
                                  branching process).
  --platform [pacbio|ont]         Sequencing platform for amplicon simulation.
                                  [default: pacbio]
  --out-dir DIRECTORY             Output folder.  [default: .]
  --out-base TEXT                 Base name for output files (auto-generated
                                  if processing multiple files).
  --coverage INTEGER              Target sequencing coverage (overrides config
                                  if provided, defaults to config value or
                                  30x).
  --seed INTEGER                  Random seed for reproducibility (same seed =
                                  identical reads).
  --track-read-source             Generate read source tracking manifest and
                                  coordinate map alongside simulated reads.
  -h, --help                      Show this message and exit.

muconeup reads illumina¶

Simulate Illumina short reads from one or more FASTA files.

The --read-number flag controls how many fragment pairs are generated before error modeling and alignment. This is the upper bound on output reads (2x read-number for paired-end). The --coverage flag then downsamples the aligned reads to the target depth. If --coverage exceeds what read-number can deliver, all reads are kept.

Examples: # Standard simulation (100k fragments from config default) muconeup --config X reads illumina sample.fa

# High-coverage simulation (500k fragments) muconeup --config X reads illumina sample.fa \ --read-number 500000 --coverage 2000

# Batch processing muconeup --config X reads illumina sample.*.simulated.fa

Usage:

muconeup reads illumina [OPTIONS] INPUT_FASTAS...

Options:

  --threads INTEGER      Number of threads.  [default: 8]
  --read-number INTEGER  Number of fragment pairs to generate (overrides
                         config read_number). More fragments allow higher
                         coverage. Default: 100000 from config.
  --out-dir DIRECTORY    Output folder.  [default: .]
  --out-base TEXT        Base name for output files (auto-generated if
                         processing multiple files).
  --coverage INTEGER     Target sequencing coverage (overrides config if
                         provided, defaults to config value or 30x).
  --seed INTEGER         Random seed for reproducibility (same seed =
                         identical reads).
  --track-read-source    Generate read source tracking manifest and coordinate
                         map alongside simulated reads.
  -h, --help             Show this message and exit.

muconeup reads ont¶

Simulate Oxford Nanopore long reads from one or more FASTA files.

Supports batch processing following Unix philosophy: - Single file: muconeup reads ont file.fa --out-base reads - Multiple files: muconeup reads ont file1.fa file2.fa file3.fa - Glob pattern: muconeup reads ont *.simulated.fa

When processing multiple files, --out-base is auto-generated from input filenames unless explicitly provided (which applies to all files).

Examples: # Single file with custom output name muconeup --config X reads ont sample.001.fa --out-base my_reads

# Multiple files (auto-generated output names) muconeup --config X reads ont sample.001.fa sample.002.fa

# Glob pattern (shell expands) muconeup --config X reads ont sample.*.simulated.fa

Usage:

muconeup reads ont [OPTIONS] INPUT_FASTAS...

Options:

  --min-read-length INTEGER  Minimum read length.  [default: 100]
  --out-dir DIRECTORY        Output folder.  [default: .]
  --out-base TEXT            Base name for output files (auto-generated if
                             processing multiple files).
  --coverage INTEGER         Target sequencing coverage (overrides config if
                             provided, defaults to config value or 30x).
  --seed INTEGER             Random seed for reproducibility (same seed =
                             identical reads).
  --track-read-source        Generate read source tracking manifest and
                             coordinate map alongside simulated reads.
  -h, --help                 Show this message and exit.

muconeup reads pacbio¶

Simulate PacBio HiFi reads from one or more FASTA files.

Supports batch processing following Unix philosophy: - Single file: muconeup reads pacbio file.fa --model-file X.model --out-base reads - Multiple files: muconeup reads pacbio file1.fa file2.fa --model-file X.model - Glob pattern: muconeup reads pacbio *.simulated.fa --model-file X.model

When processing multiple files, --out-base is auto-generated from input filenames unless explicitly provided (which applies to all files).

Workflow: 1. Multi-pass CLR simulation (pbsim3) 2. HiFi consensus generation (CCS) 3. Read alignment (minimap2 with map-hifi preset)

Examples: # Single file with standard HiFi settings (Q20) muconeup --config X reads pacbio sample.001.fa \ --model-file /models/QSHMM-SEQUEL.model \ --out-base my_hifi

# Multiple files with high-accuracy HiFi (Q30) muconeup --config X reads pacbio sample.*.fa \ --model-file /models/QSHMM-SEQUEL.model \ --min-rq 0.999 --min-passes 5

# Ultra-deep coverage simulation muconeup --config X reads pacbio sample.fa \ --model-file /models/QSHMM-SEQUEL.model \ --coverage 100 --pass-num 5

Model Files: Download from: https://github.com/yukiteruono/pbsim3/tree/master/data - QSHMM-SEQUEL.model: Sequel II chemistry - QSHMM-RSII.model: RS II chemistry - ERRHMM-SEQUEL.model: Alternative error model

Quality Control: - pass_num >=2 required for multi-pass (>=3 recommended) - min_passes controls CCS stringency (higher = better quality, lower yield) - min_rq=0.99 is Q20 (standard HiFi threshold) - min_rq=0.999 is Q30 (ultra-high accuracy)

Usage:

muconeup reads pacbio [OPTIONS] INPUT_FASTAS...

Options:

  --threads INTEGER            Number of threads.  [default: 4]
  --model-file FILE            Path to pbsim3 model file (overrides config if
                               provided).
  --model-type [qshmm|errhmm]  pbsim3 model type (overrides config if
                               provided).
  --min-rq FLOAT               Minimum predicted read quality (RQ) score
                               (0.0-1.0). 0.99=Q20 (standard HiFi), 0.999=Q30.
  --min-passes INTEGER         Minimum passes required for CCS HiFi consensus
                               (>=1, overrides config if provided).
  --pass-num INTEGER           Number of passes per molecule for multi-pass
                               CLR simulation (>=2, overrides config if
                               provided).
  --out-dir DIRECTORY          Output folder.  [default: .]
  --out-base TEXT              Base name for output files (auto-generated if
                               processing multiple files).
  --coverage INTEGER           Target sequencing coverage (overrides config if
                               provided, defaults to config value or 30x).
  --seed INTEGER               Random seed for reproducibility (same seed =
                               identical reads).
  --track-read-source          Generate read source tracking manifest and
                               coordinate map alongside simulated reads.
  -h, --help                   Show this message and exit.

muconeup simulate¶

Generate MUC1 VNTR diploid haplotypes.

Generates haplotype FASTA files only. For read simulation, pipe output to 'reads' commands (e.g., reads illumina).

Output: - {out_base}.{iteration}.simulated.fa (haplotype sequences) - {out_base}.{iteration}.vntr_structure.txt (if --output-structure) - {out_base}.{iteration}.simulation_stats.json (statistics)

Example: muconeup --config config.json simulate --out-base output

Usage:

muconeup simulate [OPTIONS]

Options:

  --out-base TEXT                 Base name for output files.  [default:
                                  muc1_simulated]
  --out-dir DIRECTORY             Output folder.  [default: .]
  --num-haplotypes INTEGER        Number of haplotypes to simulate.  [default:
                                  2]
  --seed INTEGER                  Random seed for reproducibility.
  --reference-assembly [hg19|hg38]
                                  Reference assembly (overrides config).
  --output-structure              Write VNTR structure file.
  --fixed-lengths TEXT            Fixed VNTR lengths or ranges (e.g., '60' or
                                  '20-40').
  --input-structure FILE          Predefined VNTR structure file.
  --simulate-series INTEGER       Series step size for fixed-length ranges.
  --mutation-name TEXT            Mutation name (e.g., 'dupC'). Use
                                  'normal,dupC' for dual output (normal +
                                  mutated).
  --mutation-targets TEXT         Mutation targets as 'hap_idx,rep_idx' pairs
                                  (1-based).
  --snp-input-file FILE           TSV file with predefined SNPs.
  --random-snps                   Enable random SNP generation.
  --random-snp-density FLOAT      SNP density per 1000 bp.
  --random-snp-output-file TEXT   Output file for random SNPs.
  --random-snp-region [all|constants_only|vntr_only]
                                  Region for random SNPs.  [default:
                                  constants_only]
  --random-snp-haplotypes [all|1|2]
                                  Haplotypes for random SNPs.  [default: all]
  --track-read-source             Generate read source tracking manifest and
                                  coordinate map alongside simulated reads.
  -h, --help                      Show this message and exit.

Examples¶

See Quick Start for complete usage examples.