Architecture¶
ReSeq2 learns error and quality profiles from real Illumina paired-end sequencing data, then uses those profiles to simulate realistic reads. This page describes the internal structure of the codebase.
Build Targets¶
The CMake build produces one static library and two executables:
graph LR
LIB["reseq2_lib<br/><i>static library</i>"]
CLI["reseq2<br/><i>CLI executable</i>"]
TEST["reseq2_test<br/><i>test executable</i>"]
GT["GoogleTest"]
LIB --> CLI
LIB --> TEST
GT --> TEST
reseq2_lib--- Static library containing all production source code.reseq2--- Thin CLI executable that linksreseq2_liband dispatches to commands.reseq2_test--- Test executable linkingreseq2_liband GoogleTest.
Commands¶
The entry point is reseq/main.cpp, which parses the sub-command and delegates:
| Command | Purpose |
|---|---|
illuminaPE |
Full pipeline: collect stats from BAM, estimate probabilities via IPF, simulate reads |
seqToIllumina |
Apply Illumina error/quality model to input sequences (no coverage model) |
queryProfile |
Extract summary info from .reseq stats files |
replaceN |
Replace ambiguous bases (N) in reference sequences |
convertProfile |
Convert stats/probability files between binary and text formats |
Core Components¶
All components live in reseq/ and are compiled into reseq2_lib.
Statistics Collection¶
graph TD
DS["DataStats<br/><i>top-level aggregator</i>"]
AS["AdapterStats"]
CS["CoverageStats"]
ES["ErrorStats"]
FDS["FragmentDistributionStats"]
FDUPS["FragmentDuplicationStats"]
QS["QualityStats"]
TS["TileStats"]
DS --> AS
DS --> CS
DS --> ES
DS --> FDS
DS --> FDUPS
DS --> QS
DS --> TS
| Component | Responsibility |
|---|---|
BamIngestionEngine |
BAM file reading and record processing pipeline |
DataStats |
Top-level aggregator that orchestrates all sub-stats during data collection |
AdapterStats |
Adapter detection and trimming statistics |
CoverageStats |
Per-position coverage and GC-bias modeling |
ErrorStats |
Substitution and InDel error pattern collection |
FragmentDistributionStats |
Fragment length distribution modeling |
FragmentDuplicationStats |
PCR duplicate rate estimation |
QualityStats |
Base quality score distribution modeling |
ReadSequenceStats |
Read-level sequence statistics collection |
TileStats |
Flowcell tile-level variation tracking |
Reference and Context¶
| Component | Responsibility |
|---|---|
Reference |
Reference genome loading, surrounding context extraction, excluded regions |
Surrounding / SurroundingBase |
Sequence context modeling for position-dependent error patterns |
Estimation and Simulation¶
| Component | Responsibility |
|---|---|
ProbabilityEstimates |
Iterative Proportional Fitting (IPF) for multi-dimensional probability tables |
Simulator |
Block-based read simulation engine with threading support |
Infrastructure¶
| Component | Responsibility |
|---|---|
Vect |
Offset vector (indexed from non-zero starting position), used pervasively |
types.hpp |
Type aliases, VectorAtomic wrapper, SeqAn compatibility types --- extracted from utilities.hpp during the Phase 8 refactoring |
utilities.hpp |
Helper functions and utility classes (includes types.hpp) |
Dependency Resolution¶
External dependencies are resolved through a two-tier strategy defined in cmake/ReSeqDependencies.cmake:
find_package()first --- CMake checks if the dependency is already installed on the system.FetchContentfallback --- If not found locally, CMake downloads and builds the dependency at configure time.
SeqAn and NLopt use this two-tier strategy. GoogleTest is always fetched via FetchContent (no find_package() fallback) to ensure a consistent test framework version.
| Dependency | License | Resolution |
|---|---|---|
| SeqAn 2.5.2 | BSD 3-Clause | find_package() / FetchContent |
| GoogleTest | BSD 3-Clause | FetchContent only |
| NLopt | MIT | find_package() / FetchContent |
| skewer | MIT | Vendored (local modifications, see skewer/MODIFICATIONS.md) |
Why skewer is vendored
ReSeq2 uses only the core bit-masked k-difference adapter matching algorithm from skewer. The vendored copy carries namespace wrapping, unused code removal, and C++20 compatibility fixes that are not suitable for upstreaming. See skewer/MODIFICATIONS.md for the full list of changes.
System dependencies (Boost, ZLIB, BZip2) must be installed before building --- they are not fetched automatically.
Test Structure¶
Each core component has a corresponding *Test.cpp / *Test.h pair:
reseq/
AdapterStatsTest.cpp AdapterStatsTest.h
CoverageStatsTest.cpp CoverageStatsTest.h
ErrorStatsTest.cpp ErrorStatsTest.h
SimulatorTest.cpp SimulatorTest.h
...
All test classes inherit from BasicTestClass.hpp, which extends ::testing::Test and provides shared test fixtures.
Test data lives in test/ and includes E. coli and Drosophila reference genomes, BAM files, and adapter sequences. The test/download_test_data.sh script fetches large test files that are not checked into the repository.
Tests are compiled into the single reseq2_test binary and run via CTest. See Building for details on running and filtering tests.
Versioning¶
The single source of truth for the version number is the VERSION file in the repository root (currently 1.1.0). CMake reads this file at configure time and generates version macros via CMakeConfig.h.in, including RESEQ_GIT_VERSION from git describe.