Test Datasets¶
Standardized test datasets for validating MucOneUp workflows and benchmarking performance.
Download¶
Latest Release: testdata_40-70.tar.gz
The _latest directory always contains the most recent test dataset with a constant filename for persistent download links.
Extract with: tar -xzf testdata_40-70.tar.gz
Check version: cat VERSION.json (included in _latest/ directory)
Dataset Characteristics¶
| Property | Value |
|---|---|
| VNTR Structure | Asymmetric diploid (40/70 repeats) |
| Variants | Normal + dupC mutation (position 20) |
| Platforms | Illumina (50×), ONT (30×), PacBio HiFi (30×) |
| Total Files | ~55 files (~17 MB) |
Directory Structure¶
testdata_40-70_{version}/
├── references/
│ ├── normal/
│ │ ├── testdata_40-70.001.normal.simulated.fa
│ │ ├── testdata_40-70.001.normal.simulated.structure.tsv
│ │ └── testdata_40-70.001.normal.simulated_toxic.orf_stats.json
│ └── dupC/
│ ├── testdata_40-70.001.mut.simulated.fa
│ ├── testdata_40-70.001.mut.simulated.structure.tsv
│ └── testdata_40-70.001.mut.simulated_toxic.orf_stats.json
├── illumina/
│ ├── normal/
│ │ ├── reads_R1.fastq.gz
│ │ ├── reads_R2.fastq.gz
│ │ ├── aligned.bam
│ │ ├── aligned.bam.bai
│ │ └── metadata.tsv
│ └── dupC/
│ └── ...
├── ont/
│ ├── normal/
│ │ ├── reads.fastq # Uncompressed FASTQ
│ │ ├── aligned.bam
│ │ └── metadata.tsv
│ └── dupC/
│ └── ...
├── pacbio/
│ ├── normal/
│ │ ├── reads_hifi_0001.fastq.gz
│ │ ├── reads_hifi_0002.fastq.gz
│ │ ├── aligned.bam
│ │ └── vntr_efficiency_stats.json
│ └── dupC/
│ └── ...
├── metadata/
│ ├── dataset_metadata.json
│ ├── tool_versions.txt
│ └── generation.log
└── README.txt
Generation¶
Regenerate test data using:
python helpers/generate_test_data.py \
--version v{VERSION} \
--config config.json \
--platforms all \
--threads 4 \
--verbose
Requirements: All platform-specific tools (BWA, NanoSim, pbsim3) must be installed.
Output: Tarball in data/testdata_releases/{date}_{version}/ with automatic _latest symlink for persistent downloads.
The script automatically updates the _latest symlink to point to the newest release, ensuring documentation links remain valid.
Use Cases¶
- Workflow validation: Test complete pipeline from reference generation to read simulation
- Platform comparison: Compare alignment characteristics across Illumina, ONT, and PacBio
- Mutation detection: Validate SNaPshot analysis on dupC variant
- ORF analysis: Test toxic protein detection on VNTR references
- Benchmarking: Standardized dataset for performance testing
Validation¶
Each platform directory includes metadata.tsv with read statistics:
- Read counts and length distributions
- Coverage uniformity across VNTR region
- Alignment metrics (MAPQ, error rates)
For Illumina and ONT, SNaPshot validation confirms dupC mutation detection in position 20.