Benchmarking Framework¶

Phentrieve includes a comprehensive benchmarking framework for evaluating and comparing different embedding models and configurations. This page explains how to use the framework and interpret the results.

Overview¶

The benchmarking framework tests how well different models map clinical text to the correct HPO terms. It uses a set of test cases with ground truth HPO terms and measures various metrics to evaluate performance.

Running Benchmarks¶

Basic Usage¶

# Run a benchmark with default settings
phentrieve benchmark run

# Run a benchmark with a specific model
phentrieve benchmark run --model-name "FremyCompany/BioLORD-2023-M"

# Run a benchmark with re-ranking enabled
phentrieve benchmark run --enable-reranker

Advanced Options¶

# Specify a custom test file
phentrieve benchmark run --test-file path/to/test_cases.json

# Set a custom output directory for results
phentrieve benchmark run --output-dir path/to/results

# Run benchmark with GPU acceleration
phentrieve benchmark run --gpu

Metrics¶

The benchmarking framework calculates several standard information retrieval metrics:

Mean Reciprocal Rank (MRR)¶

Measures the average position of the first relevant result in the ranked list. Higher is better.

$$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}$$

Where $rank_i$ is the position of the first relevant result for query $i$.

Hit Rate at K (HR@K)¶

The proportion of queries where a relevant result appears in the top K positions. Higher is better.

$$HR@K = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \mathbb{1}(rank_i \leq K)$$

Where $\mathbb{1}(rank_i \leq K)$ is 1 if the rank is less than or equal to K, and 0 otherwise.

Normalized Discounted Cumulative Gain (NDCG@K)¶

A ranking-aware metric that rewards relevant results at higher positions. Higher is better.

$$DCG@K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$$ $$NDCG@K = \frac{DCG@K}{IDCG@K}$$

Where $rel_i$ is the relevance score of the item at position $i$, and $IDCG@K$ is the ideal DCG.

Recall@K¶

The proportion of relevant items that appear in the top K results.

$$Recall@K = \frac{|Retrieved@K \cap Relevant|}{|Relevant|}$$

Precision@K¶

The proportion of top K results that are relevant.

$$Precision@K = \frac{|Retrieved@K \cap Relevant|}{K}$$

Mean Average Precision (MAP@K)¶

The mean of average precision scores for each query, considering only the first K results.

$$AP@K = \frac{1}{\min(K, |Relevant|)} \sum_{k=1}^{K} Precision@k \times \mathbb{1}(\text{item at rank } k \text{ is relevant})$$ $$MAP@K = \frac{1}{|Q|} \sum_{i=1}^{|Q|} AP@K_i$$

Where $\mathbb{1}(\text{item at rank } k \text{ is relevant})$ is 1 if the item at position $k$ is relevant, and 0 otherwise.

Maximum Ontology Similarity (MaxOntSim@K)¶

A domain-specific metric that measures the maximum semantic similarity between any expected HPO term and any retrieved term in the top K results, using ontology-based similarity measures.

Benchmark Results¶

The framework now includes comprehensive metrics aligned with industry standards (MTEB/BEIR). Example results from recent benchmarking:

BioLORD-2023-M (Domain-specific Model)¶

MRR: 0.5361
HR@1: 0.3333, HR@3: 0.6667, HR@5: 0.7778, HR@10: 1.0
NDCG@1: 0.3333, NDCG@3: 0.6111, NDCG@5: 0.7222, NDCG@10: 0.8611
Recall@1: 0.1667, Recall@3: 0.5000, Recall@5: 0.6667, Recall@10: 1.0
Precision@1: 1.0000, Precision@3: 1.0000, Precision@5: 1.0000, Precision@10: 1.0000
MAP@1: 0.3333, MAP@3: 0.6111, MAP@5: 0.7222, MAP@10: 0.8611
MaxOntSim@1: 0.8, MaxOntSim@3: 0.9, MaxOntSim@5: 0.95, MaxOntSim@10: 1.0

Jina-v2-base-de (German-specific Model)¶

MRR: 0.3708
HR@1: 0.2222, HR@3: 0.4444, HR@5: 0.5556, HR@10: 0.7778
NDCG@1: 0.2222, NDCG@3: 0.4074, NDCG@5: 0.4815, NDCG@10: 0.6296
Recall@1: 0.1111, Recall@3: 0.3333, Recall@5: 0.4444, Recall@10: 0.7778
Precision@1: 1.0000, Precision@3: 1.0000, Precision@5: 1.0000, Precision@10: 0.8889
MAP@1: 0.2222, MAP@3: 0.4074, MAP@5: 0.4815, MAP@10: 0.6296
MaxOntSim@1: 0.7, MaxOntSim@3: 0.8, MaxOntSim@5: 0.85, MaxOntSim@10: 0.95

These results demonstrate that domain-specific models (BioLORD) consistently outperform language-specific models for medical terminology retrieval. The new metrics provide more nuanced evaluation of ranking quality and retrieval effectiveness.

Visualizing Results¶

The benchmarking framework generates visualizations to help interpret the results:

Bar charts comparing metrics across models
Precision-recall curves
Hit rate curves

These visualizations are saved in the results/visualizations/ directory.

Customizing Benchmarks¶

Custom Test Cases¶

You can create custom test cases for benchmarking. The test file should be in JSON format:

[
  {
    "text": "The patient exhibits microcephaly and seizures.",
    "hpo_ids": ["HP:0000252", "HP:0001250"],
    "language": "en"
  },
  {
    "text": "Der Patient zeigt Mikrozephalie und Krampfanfälle.",
    "hpo_ids": ["HP:0000252", "HP:0001250"],
    "language": "de"
  }
]

Custom Metrics¶

You can implement custom metrics by extending the benchmarking framework:

from phentrieve.evaluation.metrics import BaseMetric

class CustomMetric(BaseMetric):
    def __init__(self):
        super().__init__(name="custom_metric")

    def calculate(self, results):
        # Implement your custom metric calculation
        return value

Extraction Benchmarking¶

The extraction benchmark evaluates document-level HPO extraction from clinical text against gold-standard annotations.

Overview¶

Unlike retrieval benchmarks (which test single-term lookups), extraction benchmarks evaluate the full pipeline:

Text chunking: Split documents into semantic units
HPO retrieval: Find candidate terms for each chunk
Aggregation: Deduplicate and rank results across chunks
Assertion detection: Identify present vs. absent phenotypes

Test Data Format¶

Extraction benchmarks use the PhenoBERT directory format:

tests/data/en/phenobert/
├── GSC_plus/annotations/     # 228 documents
├── ID_68/annotations/        # 68 documents
├── GeneReviews/annotations/  # 10 documents
└── conversion_report.json

Each annotation file:

{
  "doc_id": "GSC+_1003450",
  "full_text": "Clinical description...",
  "annotations": [
    {
      "hpo_id": "HP:0001250",
      "assertion_status": "affirmed",
      "evidence_spans": [{"start_char": 14, "end_char": 27}]
    }
  ]
}

Extraction Metrics¶

Metric	Formula	Description
Precision	TP / (TP + FP)	Fraction of predictions that are correct
Recall	TP / (TP + FN)	Fraction of gold terms that were found
F1	2 × P × R / (P + R)	Harmonic mean of precision and recall

Averaging strategies:

Micro: Aggregate TP/FP/FN across all documents, then calculate
Macro: Calculate per-document, then average
Weighted: Weight by document size (gold term count)

Tuning Parameters¶

The extraction benchmark exposes key pipeline parameters:

Parameter	Effect on Precision	Effect on Recall
`--num-results` ↓	↑ Higher	↓ Lower
`--chunk-threshold` ↑	↑ Higher	↓ Lower
`--min-confidence` ↑	↑ Higher	↓ Lower
`--top-term-only`	↑ Much higher	↓ Much lower

Example Results¶

GeneReviews dataset (10 documents, 237 annotations):

Configuration	Precision	Recall	F1
Default	0.088	0.422	0.145
`--top-term-only`	0.143	0.262	0.185

Running Extraction Benchmarks¶

# Quick test (10 documents)
phentrieve benchmark extraction run tests/data/en/phenobert/ \
    --dataset GeneReviews --no-bootstrap-ci

# Full evaluation (306 documents)
phentrieve benchmark extraction run tests/data/en/phenobert/ \
    --model "FremyCompany/BioLORD-2023-M"

# Compare configurations
phentrieve benchmark extraction compare \
    results/default/extraction_results.json \
    results/top_term/extraction_results.json

Multi-Vector Embeddings¶

Phentrieve supports multi-vector embeddings where each HPO term component (label, synonyms, definition) is stored as separate vectors instead of a single concatenated vector.

Concept¶

Single-vector approach (traditional):

HP:0001250 "Seizure" → embed("Seizure | Fits, Convulsions | A seizure is...")

Multi-vector approach:

HP:0001250 "Seizure":
  ├── label:      embed("Seizure")
  ├── synonym_0:  embed("Fits")
  ├── synonym_1:  embed("Convulsions")
  └── definition: embed("A seizure is a transient occurrence...")

Building Multi-Vector Indexes¶

# Build multi-vector index
phentrieve index build --multi-vector

# Both indexes can coexist
data/indexes/
├── phentrieve_biolord_2023_m/       # Single-vector
└── phentrieve_biolord_2023_m_multi/ # Multi-vector

Aggregation Strategies¶

When querying multi-vector indexes, scores from different components must be aggregated. Available strategies:

Strategy	Formula	Use Case
`label_only`	`score(label)`	Fast, label-focused
`label_synonyms_min`	`min(label, min(synonyms))`	Best single match
`label_synonyms_max`	`max(label, max(synonyms))`	Recommended
`all_max`	`max(label, synonyms, def)`	Any strong match
`all_weighted`	`w₁·label + w₂·max(syns) + w₃·def`	Custom weights

Querying Multi-Vector Indexes¶

# Query with multi-vector
phentrieve query "seizures" --multi-vector

# Specify aggregation strategy
phentrieve query "seizures" --multi-vector \
    --aggregation-strategy label_synonyms_max

Benchmarking Multi-Vector Performance¶

Compare single-vector vs multi-vector with different strategies:

# Full comparison
phentrieve benchmark compare-vectors \
    --test-file german/200cases_gemini_v1.json \
    --strategies "label_synonyms_max,all_max,label_only,all_weighted"

# Quick comparison (tiny dataset)
phentrieve benchmark compare-vectors

# Compare only multi-vector strategies
phentrieve benchmark compare-vectors --no-single \
    --strategies "label_synonyms_max,label_only"

Multi-Vector Benchmark Results¶

Comprehensive benchmarking shows multi-vector consistently outperforms single-vector:

70-case German dataset:

Mode	Strategy	MRR	Hit@1	Hit@10
single-vector	-	0.805	71.4%	98.6%
multi-vector	label_synonyms_max	0.976	95.7%	100%

200-case German dataset:

Mode	Strategy	MRR	Hit@1	Hit@10
single-vector	-	0.824	74.0%	95.0%
multi-vector	label_synonyms_max	0.937	91.0%	98.0%
multi-vector	label_only	0.943	92.0%	97.5%
multi-vector	all_max	0.934	90.5%	98.5%
multi-vector	all_weighted	0.894	84.5%	96.0%

Key findings: - Multi-vector improves MRR by +13-21% over single-vector - label_synonyms_max and label_only are the best strategies - all_weighted underperforms (definition dilutes signal) - Multi-vector achieves near-perfect Hit@10 on most datasets

Trade-offs¶

Aspect	Single-Vector	Multi-Vector
Index size	~50-100 MB	~250-500 MB
Build time	~2-5 min	~10-30 min
Query latency	~30ms	~40-50ms
Retrieval quality	Good	Excellent

Multi-vector is recommended for production use cases where retrieval quality is critical.