Initial Setup¶
Before using Phentrieve, you must prepare the HPO data and build vector indexes.
Configuration¶
Phentrieve uses a phentrieve.yaml file for configuration.
1. Copy the template: cp phentrieve.yaml.template phentrieve.yaml
2. Edit phentrieve.yaml to customize model selection or data paths.
1. Data Preparation (SQLite)¶
Phentrieve uses a local SQLite database (hpo_data.db) to store HPO terms and graph metadata for high-performance retrieval.
This command:
1. Downloads the official hp.json from the HPO ontology repository
2. Parses 19,534+ HPO terms with labels, definitions, synonyms, and comments
3. Pre-computes ontology hierarchy (ancestor graphs and term depths)
4. Stores structured data in data/hpo_data.db using optimized SQLite schema
What Gets Created¶
The data preparation process generates a compact, high-performance SQLite database:
- Database Size: ~12 MB (compared to 60 MB with previous file-based storage)
- Performance: 10-15x faster loading (0.87s vs 10-15s)
- Schema Optimizations:
- Write-Ahead Logging (WAL) mode for concurrent reads
WITHOUT ROWIDoptimization for 20% storage savings- Memory-mapped I/O for faster access
- Indexed columns for common queries
Database Schema¶
The SQLite database contains three main tables:
hpo_terms: Core HPO term data- HPO ID (e.g., HP:0000123)
- Label/name
- Definition
- Synonyms (JSON array)
-
Comments (JSON array)
-
hpo_graph_metadata: Pre-computed graph structure - Term depth (distance from root HP:0000001)
-
Ancestor set (JSON array)
-
generation_metadata: Tracking and versioning - Schema version
- Data source information
- Generation timestamps
2. Building Vector Indexes¶
Build the ChromaDB vector index for your chosen embedding model.
# Build index for the default model (BioLORD)
phentrieve index build
# Or specify a model explicitly
phentrieve index build --model-name "FremyCompany/BioLORD-2023-M"
# Build indexes for all supported models (for benchmarking)
phentrieve index build --all-models
Index Building Process¶
The index builder: 1. Loads all HPO terms from the SQLite database 2. Creates rich document representations combining labels, definitions, and synonyms 3. Generates embeddings using the specified model 4. Stores vectors in ChromaDB persistent storage
Time Estimates: - First model: 5-10 minutes (downloads model weights) - Subsequent models: 2-5 minutes (cached weights) - With GPU: 1-3 minutes
3. Supported Embedding Models¶
Phentrieve supports several multilingual embedding models optimized for different use cases:
Domain-Specific Models (Recommended)¶
FremyCompany/BioLORD-2023-M(Default)- Biomedical domain specialization
- Excellent performance on clinical terminology
- Multilingual support
Language-Specific Models¶
jinaai/jina-embeddings-v2-base-de- German language specialization
- High precision for German clinical text
General Multilingual Models¶
sentence-transformers/paraphrase-multilingual-mpnet-base-v2- 50+ languages supported
-
Good general-purpose performance
-
BAAI/bge-m3 - State-of-the-art multilingual embeddings
-
Excellent for cross-lingual retrieval
-
Alibaba-NLP/gte-multilingual-base - Optimized for retrieval tasks
- Fast inference
Model Selection
The BioLORD model is recommended for most use cases as it provides excellent performance specifically tuned for biomedical terminology. For non-English text, consider using language-specific models or cross-lingual models like BGE-M3.
4. Language Resources (Text Processing)¶
If using text processing features, ensure language resources (spaCy models) are installed. If you used make install-dev or the Docker image, these are already included.
Otherwise, you can manually install them:
python -m spacy download en_core_web_sm # English
python -m spacy download de_core_news_sm # German
python -m spacy download es_core_news_sm # Spanish
python -m spacy download fr_core_news_sm # French
python -m spacy download nl_core_news_sm # Dutch
Data Storage Locations¶
By default, Phentrieve stores its data in these locations:
data/
├── hpo_data.db # SQLite database (HPO terms and graph)
├── hp.json # Source HPO JSON file
├── indexes/ # ChromaDB persistent storage
│ └── {model_name}/ # Per-model vector stores
├── results/ # Benchmark results
├── hf_cache/ # HuggingFace model cache
└── hpo_translations/ # Translation files (if used)
You can configure these locations through environment variables or in phentrieve.yaml:
Or via environment variables:
export PHENTRIEVE_DATA_ROOT_DIR=/path/to/data
export PHENTRIEVE_INDEX_DIR=/path/to/indexes
export PHENTRIEVE_RESULTS_DIR=/path/to/results
Verification¶
Verify your setup is complete:
# Check database exists and has content
ls -lh data/hpo_data.db
# Test interactive query mode
phentrieve query --interactive
# Try a simple query
phentrieve query --text "seizures and small head"
If everything is working, you should see HPO term suggestions like: - HP:0001250 (Seizure) - HP:0000252 (Microcephaly)
Next Steps¶
Once you've completed the initial setup:
- Try Interactive Querying to test your setup
- Explore the Text Processing Guide to learn how to process clinical text
- Check out the Benchmarking Guide to evaluate model performance
Troubleshooting¶
Database Not Found¶
Solution: Runphentrieve data prepare to generate the database.
Slow Index Building¶
Solution: Use GPU acceleration if available. Check with python -c "import torch; print(torch.cuda.is_available())".
Out of Memory¶
Solution: Reduce batch size in configuration or use a smaller embedding model.
Data Migration
If upgrading from a version prior to 0.2.0, the old file-based storage (hpo_core_data/) is obsolete. Run phentrieve data prepare to regenerate the SQLite database. Old pickle files can be safely deleted after migration.