Text Processing Guide¶
Phentrieve includes robust text processing capabilities for extracting HPO terms from clinical text. This guide explains how to use these features and customize the text processing pipeline.
Text Processing Overview¶
The text processing pipeline in Phentrieve follows these steps:
- Text Chunking: Divides the input text into manageable chunks
- Embedding Generation: Converts each chunk into a vector representation
- HPO Term Retrieval: Finds relevant HPO terms for each chunk
- Assertion Detection: Determines the status of each term (affirmed, negated, etc.)
- Evidence Aggregation: Combines evidence from multiple chunks for the same HPO term
- Result Filtering: Filters results based on confidence thresholds
Chunking Strategies¶
Phentrieve provides multiple text chunking strategies that can be combined in a pipeline:
Simple Chunking¶
Divides text into paragraphs, then sentences.
Semantic Chunking¶
More advanced chunking that divides text into semantic units:
- Divides text into paragraphs
- Splits paragraphs into sentences
- Uses semantic similarity to further split sentences into meaningful chunks
Detailed Chunking¶
Even more fine-grained chunking:
- Divides text into paragraphs
- Splits paragraphs into sentences
- Uses punctuation to create fine-grained segments
- Applies semantic splitting to those segments
Sliding Window Chunking¶
Customizable semantic sliding window approach:
phentrieve text process --strategy sliding_window --window-size 128 --step-size 64 "Patient text here..."
Command-line Parameters
The parameters --window-size, --step-size, --threshold, and --min-segment override the configuration for all strategies, not just "sliding_window".
Assertion Detection¶
Phentrieve can detect the assertion status of each identified HPO term:
- Affirmed: The phenotype is positively mentioned (default)
- Negated: The phenotype is explicitly negated (e.g., "no microcephaly", "denies seizures")
- Normal: The finding is described as normal or within normal limits
- Uncertain: The phenotype is mentioned with uncertainty
Assertion detection uses both keyword-based and dependency-based approaches with a priority-based logic:
- Dependency-based negation has highest priority
- Context-specific keywords have second priority
- General negation/uncertainty keywords have lowest priority
Processing Clinical Text¶
Basic Usage¶
# Process text directly
phentrieve text process "The patient exhibits microcephaly and frequent seizures."
# Process a text file
phentrieve text process --input-file clinical_notes.txt --output-file results.json
Filtering Options¶
# Set minimum confidence threshold
phentrieve text process --min-confidence 0.4 "Patient text here..."
# Return only the highest-scoring HPO term for each chunk
phentrieve text process --top-term-per-chunk "Patient text here..."
# Specify language for better chunking and assertion detection
phentrieve text process --language de "Der Patient zeigt Mikrozephalie."
Output Formats¶
# Output as JSON (default)
phentrieve text process --output-format json "Patient text here..."
# Output as CSV
phentrieve text process --output-format csv "Patient text here..."
Example Output¶
{
"input_text": "The patient exhibits microcephaly and frequent seizures.",
"processed_chunks": [
{
"text": "The patient exhibits microcephaly",
"hpo_terms": [
{
"id": "HP:0000252",
"name": "Microcephaly",
"similarity": 0.85,
"assertion": "affirmed"
}
]
},
{
"text": "frequent seizures",
"hpo_terms": [
{
"id": "HP:0001250",
"name": "Seizures",
"similarity": 0.78,
"assertion": "affirmed"
}
]
}
],
"aggregated_results": [
{
"id": "HP:0000252",
"name": "Microcephaly",
"confidence": 0.85,
"evidence_count": 1,
"assertion": "affirmed"
},
{
"id": "HP:0001250",
"name": "Seizures",
"confidence": 0.78,
"evidence_count": 1,
"assertion": "affirmed"
}
]
}