Text Processing Pipeline¶

Phentrieve utilizes a sophisticated, configurable pipeline to transform raw clinical text into analyzable semantic units. This document explains the architecture, implementation details, and customization options.

Architecture Overview¶

The pipeline operates in sequential stages defined in your configuration. The default strategy (sliding_window_punct_conj_cleaned) follows this flow:

Normalization: Line endings and whitespace are standardized
Paragraph/Sentence Splitting: Text is broken down structurally
Fine-Grained Splitting: Punctuation and Conjunction splitting
Semantic Windowing: AI-driven semantic boundary detection
Cleaning: Removal of non-semantic artifacts

Chunking Strategies¶

Phentrieve implements multiple chunking strategies, each building on the previous one to create increasingly granular semantic units.

The Sliding Window Semantic Splitter¶

Class: SlidingWindowSemanticSplitter

This is the core innovation in Phentrieve's processing. Instead of arbitrary splits, it uses vector embeddings to detect semantic shifts in the text.

How It Works¶

Tokenization: The text segment is tokenized using simple whitespace splitting
Windowing: A sliding window (configurable size, default=7 tokens, step=1 token) moves across the text
Embedding: Each window is embedded using a fast SBERT model
Coherence Check: Cosine similarity is calculated between adjacent windows
Splitting: If similarity drops below the splitting_threshold (default 0.5), a split point is marked
Negation Merging: A heuristic pass re-merges splits that accidentally separated a negation term (e.g., "no") from its subject using language-specific resources

Configuration Parameters¶

chunking_pipeline:
  - type: sliding_window
    config:
      window_size_tokens: 7        # Size of each embedding window
      step_size_tokens: 1          # Step size (1 = maximum overlap)
      splitting_threshold: 0.5     # Cosine similarity threshold
      min_split_segment_length_words: 3  # Minimum words per segment

Performance Characteristics: - Accuracy: Highest semantic precision - Speed: Slower due to embedding computation (~100-500ms per segment with GPU) - Memory: Moderate (model must be loaded)

When to Use: - Complex sentences with multiple phenotypic mentions - Text without clear sentence boundaries - When semantic precision is paramount

Structural Chunkers¶

FineGrainedPunctuationChunker¶

Purpose: Splits on commas, semicolons, and colons while preserving special cases.

Intelligent Handling: - Decimal Numbers: Preserves "1.5", "98.6", etc. - Abbreviations: Preserves "Dr.", "vs.", "Ph.D.", "ie.", "eg.", etc. - Initials: Preserves "A.B." style initials

Implementation:

# Splits on: . , : ; ? !
# But preserves:
abbreviations = [r"\bDr\.", r"\bMs\.", r"\bMr\.", r"\bMrs\.", r"\bPh\.D\.",
                 r"\bed\.", r"\bp\.", r"\bie\.", r"\beg\.", r"\bcf\.",
                 r"\bvs\.", r"\bSt\.", r"\bJr\.", r"\bSr\.", r"[A-Z]\.[A-Z]\."]

Example:

Input:  "Patient has arachnodactyly, i.e., long fingers; heart rate is 98.6 bpm."
Output: ["Patient has arachnodactyly", "i.e., long fingers", "heart rate is 98.6 bpm"]

ConjunctionChunker¶

Purpose: Splits before coordinating conjunctions while keeping the conjunction with the following chunk.

Language Support: - English: "and", "but", "or", "nor", "for", "yet", "so" - German: "und", "aber", "oder", "denn", "sondern" - Other languages: Loaded from coordinating_conjunctions.json

Splitting Logic:

# Pattern: Split before " conjunction " (case-insensitive, word boundaries)
split_pattern = r"\s+(?=(\b(?:and|but|or|...)\b\s+))"

Example:

Input:  "Patient has seizures but no developmental delay"
Output: ["Patient has seizures", "but no developmental delay"]

Final Chunk Cleaner¶

Class: FinalChunkCleaner

Post-processes chunks to remove "low semantic value" content.

Cleaning Operations¶

Leading Stopword Removal: Strips "the", "a", "an", "with", etc. from the beginning
Trailing Stopword Removal: Strips conjunctions and articles from the end
Low-Value Chunk Filtering: Removes chunks consisting entirely of stop words
Length Filtering: Removes chunks shorter than min_cleaned_chunk_length_chars (default: 1)

Multi-Pass Cleaning¶

The cleaner performs up to max_cleanup_passes (default: 3) to handle nested cases:

Pass 1: "the and the patient" → "and the patient"
Pass 2: "and the patient" → "the patient"
Pass 3: "the patient" → "patient" (final)

Language Resources¶

Stopwords are loaded from JSON resources per language: - leading_cleanup_words.json: Words to remove from start - trailing_cleanup_words.json: Words to remove from end - low_value_words.json: Words indicating low semantic content

Example:

Input:  ["the patient", "with seizures", "and", "a small head"]
Output: ["patient", "seizures", "small head"]

Assertion Detection¶

The CombinedAssertionDetector determines if a phenotype is:

Affirmed: "Patient has seizures"
Negated: "No sign of seizures"
Normal: "Heart sounds are normal"
Uncertain: "Possible evidence of delay"

Phentrieve implements the ConText algorithm (from medspaCy) with direction-aware scope detection and TERMINATE boundaries for precise negation detection.

Detailed Documentation

For comprehensive technical details about the ConText implementation, see Clinical Negation Detection with ConText.

Quick Overview¶

Two-Tier Detection Strategy¶

┌─────────────────────────────────────┐
│  1. DependencyAssertionDetector    │ ← Primary (highest accuracy)
│     Uses: spaCy dependency parsing  │
│     Handles: Complex grammar        │
└─────────────────────────────────────┘
              ↓ (fallback if inconclusive)
┌─────────────────────────────────────┐
│  2. KeywordAssertionDetector        │ ← Fallback (fast, accurate)
│     Uses: ConText rules             │
│     Handles: Direction + boundaries │
└─────────────────────────────────────┘

ConText Features¶

Direction-Aware Detection: - FORWARD: "No fever" → negates text AFTER trigger - BACKWARD: "Fever ruled out" → negates text BEFORE trigger - BIDIRECTIONAL: French "ne...pas" → negates both sides

TERMINATE Boundaries:

"No fever but has cough"
    ↓     ↓
 negated  affirmed (scope stops at "but")

PSEUDO Prevention:

"Not only fever" → "not only" is PSEUDO, prevents false negation

Multilingual Support¶

122 ConText rules across 5 languages (EN, DE, ES, FR, NL)
Language-specific rules in phentrieve/text_processing/default_lang_resources/
Automatic fallback to English if language rules unavailable

Configuration¶

from phentrieve.text_processing.assertion_detection import KeywordAssertionDetector

# Language-specific detection
detector = KeywordAssertionDetector(language="de")
status, details = detector.detect("Ausschluss von Krampfanfällen")
# Returns: AssertionStatus.NEGATED

ConText Rule Files: - context_rules_en.json - English (26 rules) - context_rules_de.json - German (26 rules, resolves issue #79) - context_rules_es.json - Spanish (24 rules) - context_rules_fr.json - French (24 rules) - context_rules_nl.json - Dutch (22 rules) - normality_cues.json - Phentrieve-specific normalcy detection

For detailed information about rule format, detection logic, TERMINATE handling, and examples, see the Negation Detection documentation.

Complete Pipeline Configuration¶

Default Strategy: `sliding_window_punct_conj_cleaned`¶

chunking_pipeline:
  # 1. Normalize whitespace and line endings
  - type: paragraph        # Split on double newlines

  # 2. Sentence boundaries
  - type: sentence         # Split on sentence boundaries (spaCy-based)

  # 3. Fine-grained structural splitting
  - type: fine_grained_punctuation  # Split on commas, semicolons, etc.
  - type: conjunction                # Split before coordinating conjunctions

  # 4. Semantic splitting
  - type: sliding_window
    config:
      window_size_tokens: 7
      step_size_tokens: 1
      splitting_threshold: 0.5
      min_split_segment_length_words: 3

  # 5. Cleanup non-semantic elements
  - type: final_chunk_cleaner
    config:
      min_cleaned_chunk_length_chars: 1
      filter_short_low_value_chunks_max_words: 2
      max_cleanup_passes: 3

Alternative Strategies¶

`simple` - Basic Structural Splitting¶

chunking_pipeline:
  - type: paragraph
  - type: sentence

Use Case: Well-structured clinical notes with clear sentence boundaries

`sliding_window` - Pure Semantic¶

chunking_pipeline:
  - type: paragraph
  - type: sliding_window
    config:
      window_size_tokens: 10  # Larger windows for less aggressive splitting
      splitting_threshold: 0.4  # Lower threshold = more splits

Use Case: Text without punctuation, voice transcriptions

CLI Override Parameters¶

You can override pipeline configuration via CLI:

# Override sliding window parameters
phentrieve text process "..." \
  --strategy sliding_window_punct_conj_cleaned \
  --window-size 10 \
  --step-size 2 \
  --threshold 0.4 \
  --min-segment 5

# Use different strategy
phentrieve text process "..." --strategy simple

Important: CLI overrides apply to ALL stages in the pipeline that use those parameters, not just the named strategy.

Performance Considerations¶

Speed Comparison (per 1000-word document)¶

Strategy	CPU Time	GPU Time	Memory
`simple`	~50ms	~50ms	Low
`fine_grained_punctuation`	~100ms	~100ms	Low
`conjunction`	~150ms	~150ms	Low
`sliding_window`	~2000ms	~300ms	High
`sliding_window_punct_conj_cleaned`	~2500ms	~400ms	High

Optimization Tips¶

Use GPU: Provides 5-10x speedup for semantic splitting
Increase Window Step: Larger step_size_tokens = faster but less precise
Increase Threshold: Higher splitting_threshold = fewer splits = faster
Batch Processing: Process multiple documents in parallel

Custom Pipeline Creation¶

For advanced use cases, create custom pipelines programmatically:

from phentrieve.text_processing.chunkers import (
    ParagraphChunker,
    SentenceChunker,
    SlidingWindowSemanticSplitter,
    FinalChunkCleaner
)
from phentrieve.text_processing.pipeline import TextProcessingPipeline
from phentrieve.embeddings import get_model

# Load embedding model for semantic splitting
model = get_model("FremyCompany/BioLORD-2023-M")

# Create custom chunker chain
chunkers = [
    ParagraphChunker(language="en"),
    SentenceChunker(language="en"),
    SlidingWindowSemanticSplitter(
        language="en",
        model=model,
        window_size_tokens=10,
        splitting_threshold=0.3  # More aggressive splitting
    ),
    FinalChunkCleaner(
        language="en",
        min_cleaned_chunk_length_chars=5
    )
]

# Create pipeline
pipeline = TextProcessingPipeline(
    chunkers=chunkers,
    language="en"
)

# Process text
chunks = pipeline.chunk_text("Your clinical text here...")

Language Support¶

The pipeline adapts to different languages via:

spaCy Models: Language-specific sentence splitting and dependency parsing
Resource Files: Language-specific stopwords, conjunctions, negation keywords
Model Selection: Use language-specific or multilingual embedding models

Supported Languages: - English (en) - German (de) - Spanish (es) - French (fr) - Dutch (nl)

To add a new language: 1. Install spaCy model: python -m spacy download {lang}_core_web_sm 2. Add language resources to phentrieve/text_processing/resources/ 3. Configure in phentrieve.yaml

GPU Acceleration

The semantic sliding window chunker benefits significantly from GPU acceleration. On a modern GPU, processing time can be reduced from ~2.5s to ~0.4s per 1000-word document.

Model Loading

The SlidingWindowSemanticSplitter requires a SentenceTransformer model to be loaded into memory (~400MB). For processing large batches, consider reusing the same model instance across documents.

Text Processing Pipeline¶

Architecture Overview¶

Chunking Strategies¶

The Sliding Window Semantic Splitter¶

How It Works¶

Configuration Parameters¶

Structural Chunkers¶

FineGrainedPunctuationChunker¶

ConjunctionChunker¶

Final Chunk Cleaner¶

Cleaning Operations¶

Multi-Pass Cleaning¶

Language Resources¶

Assertion Detection¶

Quick Overview¶

Two-Tier Detection Strategy¶

ConText Features¶

Multilingual Support¶

Configuration¶

Complete Pipeline Configuration¶

Default Strategy: sliding_window_punct_conj_cleaned¶

Alternative Strategies¶

simple - Basic Structural Splitting¶

sliding_window - Pure Semantic¶

CLI Override Parameters¶

Performance Considerations¶

Speed Comparison (per 1000-word document)¶

Optimization Tips¶

Custom Pipeline Creation¶

Language Support¶

Default Strategy: `sliding_window_punct_conj_cleaned`¶

`simple` - Basic Structural Splitting¶

`sliding_window` - Pure Semantic¶