Text Processing Pipeline¶
Phentrieve utilizes a sophisticated, configurable pipeline to transform raw clinical text into analyzable semantic units. This document explains the architecture, implementation details, and customization options.
Architecture Overview¶
The pipeline operates in sequential stages defined in your configuration. The default strategy (sliding_window_punct_conj_cleaned) follows this flow:
- Normalization: Line endings and whitespace are standardized
- Paragraph/Sentence Splitting: Text is broken down structurally
- Fine-Grained Splitting: Punctuation and Conjunction splitting
- Semantic Windowing: AI-driven semantic boundary detection
- Cleaning: Removal of non-semantic artifacts
Chunking Strategies¶
Phentrieve implements multiple chunking strategies, each building on the previous one to create increasingly granular semantic units.
The Sliding Window Semantic Splitter¶
Class: SlidingWindowSemanticSplitter
This is the core innovation in Phentrieve's processing. Instead of arbitrary splits, it uses vector embeddings to detect semantic shifts in the text.
How It Works¶
- Tokenization: The text segment is tokenized using simple whitespace splitting
- Windowing: A sliding window (configurable size, default=7 tokens, step=1 token) moves across the text
- Embedding: Each window is embedded using a fast SBERT model
- Coherence Check: Cosine similarity is calculated between adjacent windows
- Splitting: If similarity drops below the
splitting_threshold(default 0.5), a split point is marked - Negation Merging: A heuristic pass re-merges splits that accidentally separated a negation term (e.g., "no") from its subject using language-specific resources
Configuration Parameters¶
chunking_pipeline:
- type: sliding_window
config:
window_size_tokens: 7 # Size of each embedding window
step_size_tokens: 1 # Step size (1 = maximum overlap)
splitting_threshold: 0.5 # Cosine similarity threshold
min_split_segment_length_words: 3 # Minimum words per segment
Performance Characteristics: - Accuracy: Highest semantic precision - Speed: Slower due to embedding computation (~100-500ms per segment with GPU) - Memory: Moderate (model must be loaded)
When to Use: - Complex sentences with multiple phenotypic mentions - Text without clear sentence boundaries - When semantic precision is paramount
Structural Chunkers¶
FineGrainedPunctuationChunker¶
Purpose: Splits on commas, semicolons, and colons while preserving special cases.
Intelligent Handling: - Decimal Numbers: Preserves "1.5", "98.6", etc. - Abbreviations: Preserves "Dr.", "vs.", "Ph.D.", "ie.", "eg.", etc. - Initials: Preserves "A.B." style initials
Implementation:
# Splits on: . , : ; ? !
# But preserves:
abbreviations = [r"\bDr\.", r"\bMs\.", r"\bMr\.", r"\bMrs\.", r"\bPh\.D\.",
r"\bed\.", r"\bp\.", r"\bie\.", r"\beg\.", r"\bcf\.",
r"\bvs\.", r"\bSt\.", r"\bJr\.", r"\bSr\.", r"[A-Z]\.[A-Z]\."]
Example:
Input: "Patient has arachnodactyly, i.e., long fingers; heart rate is 98.6 bpm."
Output: ["Patient has arachnodactyly", "i.e., long fingers", "heart rate is 98.6 bpm"]
ConjunctionChunker¶
Purpose: Splits before coordinating conjunctions while keeping the conjunction with the following chunk.
Language Support:
- English: "and", "but", "or", "nor", "for", "yet", "so"
- German: "und", "aber", "oder", "denn", "sondern"
- Other languages: Loaded from coordinating_conjunctions.json
Splitting Logic:
# Pattern: Split before " conjunction " (case-insensitive, word boundaries)
split_pattern = r"\s+(?=(\b(?:and|but|or|...)\b\s+))"
Example:
Input: "Patient has seizures but no developmental delay"
Output: ["Patient has seizures", "but no developmental delay"]
Final Chunk Cleaner¶
Class: FinalChunkCleaner
Post-processes chunks to remove "low semantic value" content.
Cleaning Operations¶
- Leading Stopword Removal: Strips "the", "a", "an", "with", etc. from the beginning
- Trailing Stopword Removal: Strips conjunctions and articles from the end
- Low-Value Chunk Filtering: Removes chunks consisting entirely of stop words
- Length Filtering: Removes chunks shorter than
min_cleaned_chunk_length_chars(default: 1)
Multi-Pass Cleaning¶
The cleaner performs up to max_cleanup_passes (default: 3) to handle nested cases:
Pass 1: "the and the patient" → "and the patient"
Pass 2: "and the patient" → "the patient"
Pass 3: "the patient" → "patient" (final)
Language Resources¶
Stopwords are loaded from JSON resources per language:
- leading_cleanup_words.json: Words to remove from start
- trailing_cleanup_words.json: Words to remove from end
- low_value_words.json: Words indicating low semantic content
Example:
Input: ["the patient", "with seizures", "and", "a small head"]
Output: ["patient", "seizures", "small head"]
Assertion Detection¶
The CombinedAssertionDetector determines if a phenotype is:
- Affirmed: "Patient has seizures"
- Negated: "No sign of seizures"
- Normal: "Heart sounds are normal"
- Uncertain: "Possible evidence of delay"
Phentrieve implements the ConText algorithm (from medspaCy) with direction-aware scope detection and TERMINATE boundaries for precise negation detection.
Detailed Documentation
For comprehensive technical details about the ConText implementation, see Clinical Negation Detection with ConText.
Quick Overview¶
Two-Tier Detection Strategy¶
┌─────────────────────────────────────┐
│ 1. DependencyAssertionDetector │ ← Primary (highest accuracy)
│ Uses: spaCy dependency parsing │
│ Handles: Complex grammar │
└─────────────────────────────────────┘
↓ (fallback if inconclusive)
┌─────────────────────────────────────┐
│ 2. KeywordAssertionDetector │ ← Fallback (fast, accurate)
│ Uses: ConText rules │
│ Handles: Direction + boundaries │
└─────────────────────────────────────┘
ConText Features¶
Direction-Aware Detection: - FORWARD: "No fever" → negates text AFTER trigger - BACKWARD: "Fever ruled out" → negates text BEFORE trigger - BIDIRECTIONAL: French "ne...pas" → negates both sides
TERMINATE Boundaries:
PSEUDO Prevention:
Multilingual Support¶
- 122 ConText rules across 5 languages (EN, DE, ES, FR, NL)
- Language-specific rules in
phentrieve/text_processing/default_lang_resources/ - Automatic fallback to English if language rules unavailable
Configuration¶
from phentrieve.text_processing.assertion_detection import KeywordAssertionDetector
# Language-specific detection
detector = KeywordAssertionDetector(language="de")
status, details = detector.detect("Ausschluss von Krampfanfällen")
# Returns: AssertionStatus.NEGATED
ConText Rule Files:
- context_rules_en.json - English (26 rules)
- context_rules_de.json - German (26 rules, resolves issue #79)
- context_rules_es.json - Spanish (24 rules)
- context_rules_fr.json - French (24 rules)
- context_rules_nl.json - Dutch (22 rules)
- normality_cues.json - Phentrieve-specific normalcy detection
For detailed information about rule format, detection logic, TERMINATE handling, and examples, see the Negation Detection documentation.
Complete Pipeline Configuration¶
Default Strategy: sliding_window_punct_conj_cleaned¶
chunking_pipeline:
# 1. Normalize whitespace and line endings
- type: paragraph # Split on double newlines
# 2. Sentence boundaries
- type: sentence # Split on sentence boundaries (spaCy-based)
# 3. Fine-grained structural splitting
- type: fine_grained_punctuation # Split on commas, semicolons, etc.
- type: conjunction # Split before coordinating conjunctions
# 4. Semantic splitting
- type: sliding_window
config:
window_size_tokens: 7
step_size_tokens: 1
splitting_threshold: 0.5
min_split_segment_length_words: 3
# 5. Cleanup non-semantic elements
- type: final_chunk_cleaner
config:
min_cleaned_chunk_length_chars: 1
filter_short_low_value_chunks_max_words: 2
max_cleanup_passes: 3
Alternative Strategies¶
simple - Basic Structural Splitting¶
Use Case: Well-structured clinical notes with clear sentence boundaries
sliding_window - Pure Semantic¶
chunking_pipeline:
- type: paragraph
- type: sliding_window
config:
window_size_tokens: 10 # Larger windows for less aggressive splitting
splitting_threshold: 0.4 # Lower threshold = more splits
CLI Override Parameters¶
You can override pipeline configuration via CLI:
# Override sliding window parameters
phentrieve text process "..." \
--strategy sliding_window_punct_conj_cleaned \
--window-size 10 \
--step-size 2 \
--threshold 0.4 \
--min-segment 5
# Use different strategy
phentrieve text process "..." --strategy simple
Important: CLI overrides apply to ALL stages in the pipeline that use those parameters, not just the named strategy.
Performance Considerations¶
Speed Comparison (per 1000-word document)¶
| Strategy | CPU Time | GPU Time | Memory |
|---|---|---|---|
simple |
~50ms | ~50ms | Low |
fine_grained_punctuation |
~100ms | ~100ms | Low |
conjunction |
~150ms | ~150ms | Low |
sliding_window |
~2000ms | ~300ms | High |
sliding_window_punct_conj_cleaned |
~2500ms | ~400ms | High |
Optimization Tips¶
- Use GPU: Provides 5-10x speedup for semantic splitting
- Increase Window Step: Larger
step_size_tokens= faster but less precise - Increase Threshold: Higher
splitting_threshold= fewer splits = faster - Batch Processing: Process multiple documents in parallel
Custom Pipeline Creation¶
For advanced use cases, create custom pipelines programmatically:
from phentrieve.text_processing.chunkers import (
ParagraphChunker,
SentenceChunker,
SlidingWindowSemanticSplitter,
FinalChunkCleaner
)
from phentrieve.text_processing.pipeline import TextProcessingPipeline
from phentrieve.embeddings import get_model
# Load embedding model for semantic splitting
model = get_model("FremyCompany/BioLORD-2023-M")
# Create custom chunker chain
chunkers = [
ParagraphChunker(language="en"),
SentenceChunker(language="en"),
SlidingWindowSemanticSplitter(
language="en",
model=model,
window_size_tokens=10,
splitting_threshold=0.3 # More aggressive splitting
),
FinalChunkCleaner(
language="en",
min_cleaned_chunk_length_chars=5
)
]
# Create pipeline
pipeline = TextProcessingPipeline(
chunkers=chunkers,
language="en"
)
# Process text
chunks = pipeline.chunk_text("Your clinical text here...")
Language Support¶
The pipeline adapts to different languages via:
- spaCy Models: Language-specific sentence splitting and dependency parsing
- Resource Files: Language-specific stopwords, conjunctions, negation keywords
- Model Selection: Use language-specific or multilingual embedding models
Supported Languages: - English (en) - German (de) - Spanish (es) - French (fr) - Dutch (nl)
To add a new language:
1. Install spaCy model: python -m spacy download {lang}_core_web_sm
2. Add language resources to phentrieve/text_processing/resources/
3. Configure in phentrieve.yaml
GPU Acceleration
The semantic sliding window chunker benefits significantly from GPU acceleration. On a modern GPU, processing time can be reduced from ~2.5s to ~0.4s per 1000-word document.
Model Loading
The SlidingWindowSemanticSplitter requires a SentenceTransformer model to be loaded into memory (~400MB). For processing large batches, consider reusing the same model instance across documents.