Data Management¶
This page explains how Phentrieve manages data in different deployment scenarios and how to configure data directories.
Data Directory Structure¶
Phentrieve organizes its data in a structured directory hierarchy:
/your/data/dir/
├── hpo_core_data/ # HPO source files (hp.json, etc.)
├── indexes/ # ChromaDB persistent storage
│ ├── model_name_1/ # Vector indexes for specific models
│ ├── model_name_2/
│ └── ...
├── results/ # Benchmark results
│ ├── summaries/ # JSON summaries per run/model
│ ├── visualizations/ # Plot images
│ └── detailed/ # Detailed CSV results per run
└── hpo_translations/ # Translation files (if used)
Configuration Options¶
Environment Variables¶
You can configure data directories using environment variables:
PHENTRIEVE_DATA_DIR: Base directory for all Phentrieve dataPHENTRIEVE_HPO_DATA_DIR: Directory for HPO data filesPHENTRIEVE_INDEX_DIR: Directory for vector indexesPHENTRIEVE_RESULTS_DIR: Directory for benchmark resultsPHENTRIEVE_TRANSLATIONS_DIR: Directory for translation files (if used)
Docker Environment¶
For Docker deployments, specify the host data directory in your .env.docker file:
This directory will be mounted into the containers at /app/data.
Data Management Commands¶
Preparing HPO Data¶
This command: 1. Downloads the official HPO data in JSON format 2. Processes it into the format required by Phentrieve 3. Stores the processed data in the configured data directory
Building Indexes¶
# Build index for a specific model
phentrieve index build --model-name "FremyCompany/BioLORD-2023-M"
# Build indexes for all supported models
phentrieve index build --all-models
These commands create vector indexes in the configured index directory.
Data Cleanup¶
# Clean HPO data (removes downloaded and processed files)
phentrieve data clean
# Clean indexes (removes all vector stores)
phentrieve index clean
Data Loss
These commands permanently delete data. Make sure you have backups if needed.
Persistent Storage Considerations¶
Local Deployments¶
For local deployments, data is stored in the configured directories on your filesystem. By default, these are subdirectories of the current working directory or the location specified by PHENTRIEVE_DATA_DIR.
Docker Deployments¶
For Docker deployments, data is persisted through volume mounts:
This ensures that data remains available between container restarts and updates.
Backup Recommendations¶
Regular backups of the data directories are recommended, especially:
- HPO core data after processing
- Vector indexes after building (these can be time-consuming to rebuild)
- Benchmark results
You can use standard filesystem backup tools for this purpose.