Algorithms and Packages Reference¶
This document provides a comprehensive reference of the algorithms, packages, methods, and functions used throughout Phentrieve's implementation.
Overview¶
Phentrieve implements a multi-stage pipeline combining natural language processing, vector embeddings, and information retrieval to map clinical text to HPO terms. This reference documents the technical implementation details.
Core Algorithms¶
1. Semantic Chunking¶
Algorithm: Sliding Window Semantic Splitting¶
Mathematical Foundation:
Given a text segment with tokens $T = [t_1, t_2, ..., t_n]$:
- Window Creation: Create overlapping windows of size $w$ with step $s$:
-
$W_i = [t_i, t_{i+1}, ..., t_{i+w-1}]$ where $i \in {1, s+1, 2s+1, ...}$
-
Embedding: Embed each window using function $E$:
-
$v_i = E(W_i)$ where $v_i \in \mathbb{R}^d$
-
Similarity Computation: Calculate cosine similarity between adjacent windows: $$\text{sim}(v_i, v_{i+1}) = \frac{v_i \cdot v_{i+1}}{||v_i|| \cdot ||v_{i+1}||}$$
-
Split Detection: Mark split point when: $$\text{sim}(v_i, v_{i+1}) < \theta$$ where $\theta$ is the splitting threshold (default: 0.5)
Implementation:
- Class: SlidingWindowSemanticSplitter
- Module: phentrieve.text_processing.chunkers
- Key Methods:
- chunk(text_segments: list[str]) -> list[str]
- _split_one_segment_by_sliding_window(text: str) -> list[str]
Parameters:
- window_size_tokens: Size of sliding window (default: 7)
- step_size_tokens: Step between windows (default: 1)
- splitting_threshold: Similarity threshold for splitting (default: 0.5)
- min_split_segment_length_words: Minimum segment length (default: 3)
Dependencies:
- sentence-transformers - For window embedding
- numpy - For cosine similarity computation
- torch - For tensor operations
2. Dense Retrieval¶
Algorithm: Bi-Encoder Semantic Search¶
Mathematical Foundation:
Given a query $q$ and HPO term documents $D = {d_1, d_2, ..., d_m}$:
-
Query Embedding: $$q_v = E_\text{query}(q)$$
-
Document Embedding (pre-computed): $$d_{v_i} = E_\text{doc}(d_i)$$
-
Similarity Scoring: $$\text{score}(q, d_i) = \frac{q_v \cdot d_{v_i}}{||q_v|| \cdot ||d_{v_i}||}$$
-
Top-K Retrieval: $$\text{results} = \text{argsort}_k{\text{score}(q, d_i) : i \in [1, m]}$$
Implementation:
- Class: DenseRetriever
- Module: phentrieve.retrieval.dense_retriever
- Key Methods:
- query(text: str, k: int) -> list[dict]
- query_batch(texts: list[str], k: int) -> list[list[dict]]
- _embed_queries(texts: list[str]) -> np.ndarray
Batch Processing Optimization: - Processes multiple queries in single forward pass - Uses matrix multiplication for parallel similarity computation - 10-20x faster than sequential processing
Dependencies:
- sentence-transformers.SentenceTransformer - Bi-encoder models
- chromadb - Vector database for storage and retrieval
- torch - GPU acceleration for embedding generation
- numpy - Matrix operations
3. Cross-Encoder Re-Ranking¶
Algorithm: Cross-Attention Scoring¶
Mathematical Foundation:
Given query $q$ and candidate documents $C = {c_1, c_2, ..., c_k}$ from dense retrieval:
-
Pair Construction: $$\text{pairs} = {(q, c_1), (q, c_2), ..., (q, c_k)}$$
-
Cross-Attention Scoring: $$\text{score}_\text{cross}(q, c_i) = \text{CrossEncoder}([q; c_i])$$ where $[q; c_i]$ denotes concatenation with separator token
-
Re-Ranking: $$\text{results}\text{reranked} = \text{argsort}_k{\text{score}\text{cross}(q, c_i) : c_i \in C}$$
Implementation:
- Class: CrossEncoder
- Module: phentrieve.retrieval.reranker
- Key Methods:
- protected_dense_rerank(query, candidates, cross_encoder, trust_threshold) - Protected two-stage retrieval
- rerank_candidates(query, candidates, cross_encoder) - Standard reranking
Protected Two-Stage Retrieval:
Phentrieve implements protected reranking to prevent cross-encoders from demoting correct cross-lingual matches:
- Protection: Dense matches with similarity ≥0.7 are preserved at top positions
- Refinement: Lower-confidence candidates are reranked by cross-encoder
- Merge: Protected results stay on top, reranked results fill below
Dependencies:
- sentence-transformers.CrossEncoder - Cross-encoder models
- torch - GPU acceleration
4. Assertion Detection¶
Algorithm: Dependency-Based Negation Detection¶
Rule-Based Logic:
- Dependency Parsing: Parse sentence into dependency tree using spaCy
- Negation Detection: Identify negation markers (
neg,no,without) - Scope Determination: Find all tokens within negation scope via dependency traversal
- Classification:
NEGATEDif chunk in negation scopeUNCERTAINif uncertainty markers presentNORMALif normal markers presentAFFIRMEDotherwise
Implementation:
- Class: CombinedAssertionDetector
- Module: phentrieve.text_processing.assertion_detection
- Key Methods:
- detect_assertion(text: str, chunk: str) -> AssertionType
- _dependency_based_detection(doc, chunk_text: str) -> AssertionType
- _keyword_based_detection(text: str, chunk: str) -> AssertionType
Dependencies:
- spacy - Dependency parsing and NLP pipeline
- Custom language resources (JSON files with negation keywords)
Key Packages and Their Roles¶
Embedding and Retrieval¶
sentence-transformers (v2.2.2+)¶
Purpose: State-of-the-art embedding models for semantic similarity
Key Classes:
- SentenceTransformer: Bi-encoder for dense retrieval
- Methods: encode(), encode_multi_process()
- CrossEncoder: Cross-attention re-ranking
- Methods: predict(), rank()
Models Used:
- FremyCompany/BioLORD-2023-M - Biomedical domain (default)
- jinaai/jina-embeddings-v2-base-de - German specialized
- BAAI/bge-m3 - Multilingual general purpose
- BAAI/bge-reranker-v2-m3 - Multilingual re-ranker
Usage in Phentrieve:
# Dense retrieval
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("FremyCompany/BioLORD-2023-M")
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
# Re-ranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
scores = reranker.predict([(query, doc) for doc in candidates])
ChromaDB (v0.4.0+)¶
Purpose: Vector database for efficient similarity search
Key Classes:
- chromadb.Client: Database client
- chromadb.Collection: Vector collection management
Key Methods:
- create_collection() - Initialize vector store
- add() - Batch insert embeddings
- query() - Similarity search with filtering
- get() - Retrieve by ID
- update() - Update embeddings
Algorithm: HNSW (Hierarchical Navigable Small World) - Time Complexity: O(log N) for queries - Space Complexity: O(N × M × d) where N=docs, M=neighbors, d=dimensions
Usage in Phentrieve:
import chromadb
client = chromadb.PersistentClient(path="./data/indexes")
collection = client.get_or_create_collection(
name="hpo_embeddings",
metadata={"hnsw:space": "cosine"}
)
results = collection.query(
query_embeddings=query_vectors,
n_results=20,
include=["documents", "distances", "metadatas"]
)
PyTorch (v2.0+)¶
Purpose: Deep learning framework for model inference
Key Functions:
- torch.cuda.is_available() - GPU detection
- torch.device() - Device management
- torch.nn.functional.cosine_similarity() - Similarity computation
- model.to(device) - Device placement
Automatic Device Selection:
if torch.cuda.is_available():
device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
device = "mps" # Apple Silicon
else:
device = "cpu"
Text Processing¶
spaCy (v3.5+)¶
Purpose: Industrial-strength NLP for text processing
Models Used:
- en_core_web_sm - English
- de_core_news_sm - German
- es_core_news_sm - Spanish
- fr_core_news_sm - French
- nl_core_news_sm - Dutch
Key Components: - Tokenizer: Word and sentence tokenization - POS Tagger: Part-of-speech tagging - Dependency Parser: Grammatical dependency trees - NER: Named entity recognition (optional)
Methods Used:
import spacy
nlp = spacy.load("en_core_web_sm")
# Sentence splitting
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
# Dependency parsing for negation detection
for token in doc:
if token.dep_ == "neg":
# Find negation scope via dependency tree
scope = [child for child in token.head.subtree]
NumPy (v1.24+)¶
Purpose: Numerical computing for vector operations
Key Functions:
- np.dot() - Dot product for similarity
- np.linalg.norm() - Vector magnitude
- np.argsort() - Sorting for top-k retrieval
- np.concatenate() - Batch concatenation
Cosine Similarity Implementation:
import numpy as np
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Data Processing¶
SQLite3 (Python stdlib)¶
Purpose: Lightweight database for HPO metadata storage
Key Functions:
- sqlite3.connect() - Database connection
- cursor.execute() - SQL execution
- cursor.executemany() - Batch inserts
- cursor.executescript() - Schema creation
Optimizations Applied:
-- Write-Ahead Logging for concurrent reads
PRAGMA journal_mode = WAL;
-- WITHOUT ROWID for 20% space savings
CREATE TABLE hpo_terms (...) WITHOUT ROWID;
-- Indexes for fast queries
CREATE INDEX idx_hpo_terms_label ON hpo_terms(label);
Methods in Phentrieve:
from phentrieve.data_processing.hpo_database import HPODatabase
db = HPODatabase("data/hpo_data.db")
db.initialize_schema()
terms = db.load_all_terms() # Returns list[dict]
ancestors, depths = db.load_graph_data() # Returns tuple[dict, dict]
API and Frontend¶
FastAPI (v0.104+)¶
Purpose: High-performance async web framework
Key Decorators:
- @app.get(), @app.post() - Route handlers
- @lru_cache - Response caching
- Dependency injection via Depends()
Features Used: - Automatic OpenAPI documentation - Pydantic validation - Async request handling - CORS middleware - Background tasks for long-running operations
Pydantic (v2.0+)¶
Purpose: Data validation and serialization
Usage:
from pydantic import BaseModel, Field
class HPOQueryRequest(BaseModel):
text: str = Field(..., min_length=1, max_length=10000)
num_results: int = Field(default=5, ge=1, le=100)
enable_reranker: bool = Field(default=False)
Vue 3 + Vuetify¶
Purpose: Reactive frontend framework
Key Features: - Composition API for component logic - Pinia for state management - Vue Router for navigation - Vue i18n for internationalization
Performance Characteristics¶
Algorithmic Complexity¶
| Algorithm | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Semantic Chunking | O(n × w × d) | O(n × d) | n=tokens, w=window, d=dims |
| Dense Retrieval | O(k × d) | O(m × d) | k=queries, m=docs, d=dims |
| HNSW Search | O(log m) | O(m × M × d) | M=neighbors per layer |
| Cross-Encoder | O(k × L²) | O(k × L) | L=sequence length |
| Dependency Parsing | O(n²) | O(n) | n=sentence tokens |
Optimization Techniques¶
Batch Processing:
# Instead of: for text in texts: model.encode(text)
# Use batch processing:
embeddings = model.encode(texts, batch_size=32) # 10-20x faster
Caching:
from functools import lru_cache
@lru_cache(maxsize=128)
def get_model(model_name: str):
"""Cache loaded models to avoid reloading."""
return SentenceTransformer(model_name)
Lazy Loading:
# Don't load models at import time
# Load only when first used
_model_registry: dict[str, SentenceTransformer] = {}
def get_or_load_model(name: str):
if name not in _model_registry:
_model_registry[name] = SentenceTransformer(name)
return _model_registry[name]
Implementation Modules¶
Phentrieve Package Structure¶
phentrieve/
├── embeddings.py # Model loading and caching
├── retrieval/
│ ├── dense_retriever.py # DenseRetriever class
│ └── reranker.py # CrossEncoder re-ranking
├── text_processing/
│ ├── chunkers.py # All chunking classes
│ ├── assertion_detection.py # Assertion detection
│ └── pipeline.py # TextProcessingPipeline
├── data_processing/
│ ├── hpo_database.py # HPODatabase class
│ └── hpo_parser.py # HPO JSON parsing
└── evaluation/
└── metrics.py # Benchmark metrics
Key Classes and Methods¶
DenseRetriever¶
File: phentrieve/retrieval/dense_retriever.py
Methods:
class DenseRetriever:
def query(self, text: str, k: int = 5) -> list[dict]:
"""Single query retrieval."""
def query_batch(self, texts: list[str], k: int = 5) -> list[list[dict]]:
"""Batch query retrieval (10-20x faster)."""
def _embed_queries(self, texts: list[str]) -> np.ndarray:
"""Generate embeddings for queries."""
def _compute_similarities(self, query_emb, doc_embs) -> np.ndarray:
"""Compute cosine similarities."""
SlidingWindowSemanticSplitter¶
File: phentrieve/text_processing/chunkers.py
Methods:
class SlidingWindowSemanticSplitter(TextChunker):
def chunk(self, text_segments: list[str]) -> list[str]:
"""Process and split text segments."""
def _split_one_segment_by_sliding_window(self, text: str) -> list[str]:
"""Split single segment using sliding window."""
def _create_sliding_windows(self, tokens: list[str]) -> list[str]:
"""Create overlapping windows."""
def _compute_window_embeddings(self, windows: list[str]) -> np.ndarray:
"""Embed all windows in batch."""
HPODatabase¶
File: phentrieve/data_processing/hpo_database.py
Methods:
class HPODatabase:
def initialize_schema(self) -> None:
"""Create database schema with optimizations."""
def bulk_insert_terms(self, terms: list[dict]) -> int:
"""Batch insert HPO terms."""
def load_all_terms(self) -> list[dict]:
"""Load terms with JSON deserialization."""
def load_graph_data(self) -> tuple[dict, dict]:
"""Load ancestors and depths."""
Configuration¶
Model Configuration (phentrieve.yaml)¶
embeddings:
model_name: "FremyCompany/BioLORD-2023-M"
device: "auto" # auto, cuda, mps, cpu
batch_size: 32
normalize_embeddings: true
retrieval:
num_results: 20
enable_reranker: true
reranker_model: "BAAI/bge-reranker-v2-m3"
rerank_candidate_count: 50
dense_trust_threshold: 0.7 # Protect high-confidence dense matches
chunking_pipeline:
- type: paragraph
- type: sentence
- type: fine_grained_punctuation
- type: conjunction
- type: sliding_window
config:
window_size_tokens: 7
step_size_tokens: 1
splitting_threshold: 0.5
- type: final_chunk_cleaner
References¶
Academic Papers¶
-
Sentence-BERT: Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019.
-
Dense Passage Retrieval: Karpukhin, V., et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP 2020.
-
Cross-Encoders: Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085.
-
HNSW Algorithm: Malkov, Y. A., & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI.
Package Documentation¶
- sentence-transformers: https://www.sbert.net/
- ChromaDB: https://docs.trychroma.com/
- spaCy: https://spacy.io/api
- FastAPI: https://fastapi.tiangolo.com/
- PyTorch: https://pytorch.org/docs/
Version Compatibility¶
| Package | Minimum Version | Tested Version | Notes |
|---|---|---|---|
| sentence-transformers | 2.2.2 | 2.7.0 | Core functionality |
| chromadb | 0.4.0 | 0.5.0 | Vector storage |
| spacy | 3.5.0 | 3.7.0 | NLP pipeline |
| torch | 2.0.0 | 2.3.0 | GPU acceleration |
| fastapi | 0.104.0 | 0.110.0 | API framework |
| pydantic | 2.0.0 | 2.6.0 | Validation |
| numpy | 1.24.0 | 1.26.0 | Numerical ops |
| sqlite3 | 3.35.0 | 3.45.0 | Database (stdlib) |
GPU Acceleration
All embedding and retrieval operations automatically utilize GPU when available via PyTorch's device management. CPU fallback is automatic and seamless.
Performance Optimization
For production deployments: - Use batch processing wherever possible (10-20x speedup) - Enable GPU acceleration for embedding generation (3-5x speedup) - Use model caching to avoid reloading weights - Configure appropriate batch sizes based on available memory