Benchmarking Guide¶
This page provides a guide to using Phentrieve's benchmarking capabilities for evaluating model performance.
Introduction¶
Benchmarking is essential for evaluating the performance of different embedding models and configurations. Phentrieve includes a comprehensive benchmarking framework that allows you to compare model performance using standardized metrics.
Running Benchmarks¶
# Run a benchmark with default settings
phentrieve benchmark run
# Run a benchmark with a specific model
phentrieve benchmark run --model-name "FremyCompany/BioLORD-2023-M"
# Run a benchmark with re-ranking enabled
phentrieve benchmark run --enable-reranker
Benchmark Metrics¶
The benchmarking framework calculates several information retrieval metrics:
- Mean Reciprocal Rank (MRR): Average position of the first relevant result
- Hit Rate at K (HR@K): Proportion of queries with a relevant result in the top K positions
- Recall: Proportion of relevant items that are retrieved
Interpreting Results¶
Benchmark results are stored in the results/ directory:
summaries/: JSON summaries for each modelvisualizations/: Charts and plots comparing model performancedetailed/: Detailed CSV results
Extraction Benchmarking¶
Evaluate document-level HPO extraction against gold-standard annotations:
# Run on PhenoBERT test data (306 documents)
phentrieve benchmark extraction run tests/data/en/phenobert/
# Run on specific dataset (GeneReviews: 10 docs, good for quick tests)
phentrieve benchmark extraction run tests/data/en/phenobert/ --dataset GeneReviews
# High precision mode (fewer false positives)
phentrieve benchmark extraction run tests/data/en/phenobert/ --top-term-only
# Custom thresholds
phentrieve benchmark extraction run tests/data/en/phenobert/ \
--chunk-threshold 0.6 --min-confidence 0.6 --num-results 2
Key Parameters¶
| Parameter | Default | Description |
|---|---|---|
--dataset |
all | PhenoBERT subset: all, GSC_plus, ID_68, GeneReviews |
--num-results |
3 | HPO candidates per chunk |
--chunk-threshold |
0.5 | Minimum similarity for chunk matching |
--min-confidence |
0.5 | Minimum confidence for final results |
--top-term-only |
false | Keep only best match per chunk |
Extraction Metrics¶
- Precision: Proportion of predicted terms that are correct
- Recall: Proportion of gold terms that were found
- F1 Score: Harmonic mean of precision and recall
- Bootstrap CI: 95% confidence intervals via bootstrap sampling
Comparing Results¶
# Compare two benchmark runs
phentrieve benchmark extraction compare results/run1/extraction_results.json \
results/run2/extraction_results.json
# Generate report from multiple runs
phentrieve benchmark extraction report results/
Multi-Vector vs Single-Vector Comparison¶
Compare the performance of single-vector embeddings against multi-vector embeddings with different aggregation strategies:
# Compare with default strategies
phentrieve benchmark compare-vectors
# Compare specific strategies on a dataset
phentrieve benchmark compare-vectors \
--test-file german/200cases_gemini_v1.json \
--strategies "label_synonyms_max,all_max,label_only"
# Skip single-vector (only compare multi-vector strategies)
phentrieve benchmark compare-vectors --no-single \
--strategies "label_synonyms_max,all_max,all_weighted"
Aggregation Strategies¶
| Strategy | Description | Best For |
|---|---|---|
label_synonyms_max |
Best match between label and synonyms | Recommended default |
label_only |
Match only against label vectors | High precision |
all_max |
Best match across all components | Balanced |
all_weighted |
Weighted combination of all components | Custom tuning |
Example Results¶
Results from 200-case German benchmark dataset:
| Mode | Strategy | MRR | Hit@1 | Hit@10 |
|---|---|---|---|---|
| single-vector | - | 0.824 | 74.0% | 95.0% |
| multi-vector | label_synonyms_max | 0.937 | 91.0% | 98.0% |
| multi-vector | label_only | 0.943 | 92.0% | 97.5% |
| multi-vector | all_max | 0.934 | 90.5% | 98.5% |
Multi-vector embeddings consistently outperform single-vector by +13-21% MRR.
Further Reading¶
For more advanced benchmarking information, see the Benchmarking Framework page in the Advanced Topics section.