Custom Annotation with Local Files
Variant-Linker supports annotating variants with custom genomic features from your own files. This powerful feature allows you to overlay variants with region-based annotations (BED files), gene lists, or structured gene data (JSON files) to identify clinically relevant overlaps.
Overview
The custom annotation feature enables you to:
- Annotate with genomic regions from BED files (promoters, enhancers, regulatory elements)
- Filter by gene lists (cancer genes, disease panels, custom gene sets)
- Add structured gene metadata from JSON files (panel information, pathogenicity scores, classifications)
- Combine multiple file types for comprehensive annotation
- Include results in all output formats (JSON, CSV, TSV, VCF)
All custom annotations appear in the user_feature_overlap
field (JSON) or UserFeatureOverlap
column (CSV/TSV).
Quick Start
# Annotate with a BED file containing regulatory regions
variant-linker --variant "rs6025" --bed-file regulatory_regions.bed --output CSV
# Filter variants by a cancer gene list
variant-linker --vcf-input variants.vcf --gene-list cancer_genes.txt --output JSON
# Add structured gene panel information
variant-linker --variants-file batch.txt \
--json-genes gene_panels.json \
--json-gene-mapping '{"identifier":"gene_symbol","dataFields":["panel","classification"]}' \
--output TSV
File Formats
BED Files (--bed-file
/ -bf
)
BED (Browser Extensible Data) files define genomic regions. Variant-Linker supports standard BED formats:
3-Column BED (Minimal)
chr1 1000 2000
chr1 5000 6000
chrX 10000 11000
4-Column BED (With Names)
chr1 1000 2000 promoter_region_1
chr1 5000 6000 enhancer_region_1
chrX 10000 11000 regulatory_element_1
6-Column BED (Full Format)
chr1 1000 2000 promoter_BRCA1 800 +
chr1 5000 6000 enhancer_TP53 600 -
chrX 10000 11000 regulatory_AR 900 +
Columns:
- Chromosome (required):
chr1
,1
,chrX
,X
(chr prefix optional) - Start (required): 0-based start position
- End (required): 1-based end position
- Name (optional): Region identifier/description
- Score (optional): Numeric score (0-1000)
- Strand (optional):
+
,-
, or.
Features:
- Header lines (
#
,track
,browser
) are automatically skipped - Empty lines and comments are ignored
- Invalid coordinates are skipped with warnings
- Chromosome names are normalized (chr prefix removed)
Gene List Files (--gene-list
/ -gl
)
Simple text files with one gene identifier per line:
BRCA1
BRCA2
TP53
ATM
CHEK2
PALB2
# This is a comment line
MLH1
MSH2
Supported Identifiers:
- Gene symbols:
BRCA1
,TP53
,MYC
- Ensembl gene IDs:
ENSG00000012048
,ENSG00000141510
Features:
- One gene per line
- Comment lines starting with
#
are ignored - Empty lines are skipped
- Case-sensitive matching
- Multiple files can be specified
JSON Gene Files (--json-genes
/ -jg
)
Structured JSON files containing gene information with flexible field mapping:
Array Format
[
{
"gene_symbol": "BRCA1",
"panel": "Hereditary Cancer",
"classification": "High Penetrance",
"inheritance": "Autosomal Dominant",
"diseases": ["Breast Cancer", "Ovarian Cancer"]
},
{
"gene_symbol": "BRCA2",
"panel": "Hereditary Cancer",
"classification": "High Penetrance",
"inheritance": "Autosomal Dominant",
"diseases": ["Breast Cancer", "Ovarian Cancer", "Pancreatic Cancer"]
}
]
Object Format
{
"BRCA1": {
"symbol": "BRCA1",
"panel_name": "Breast_Cancer_Panel",
"pathogenicity_score": 0.95,
"clinical_significance": "Pathogenic"
},
"TP53": {
"symbol": "TP53",
"panel_name": "Tumor_Suppressor_Panel",
"pathogenicity_score": 0.98,
"clinical_significance": "Pathogenic"
}
}
Required Parameter: --json-gene-mapping
The mapping parameter defines how to extract gene identifiers and additional data:
# Basic mapping (identifier only)
--json-gene-mapping '{"identifier":"gene_symbol"}'
# Full mapping (identifier + metadata fields)
--json-gene-mapping '{"identifier":"symbol","dataFields":["panel_name","pathogenicity_score","clinical_significance"]}'
Mapping Fields:
identifier
(required): Field containing the gene identifierdataFields
(optional): Array of additional fields to include in output
CLI Options
Core Options
Option | Alias | Type | Description |
---|---|---|---|
--bed-file | -bf | Array | Path to BED file(s) with genomic regions |
--gene-list | -gl | Array | Path to gene list file(s) |
--json-genes | -jg | Array | Path to JSON gene file(s) |
--json-gene-mapping | String | JSON mapping for JSON gene files |
Usage Notes
- Multiple Files: Each option accepts multiple files:
--bed-file file1.bed file2.bed
- File Combinations: Mix and match different file types in a single command
- Required Mapping:
--json-gene-mapping
is required when using--json-genes
- Path Resolution: Supports absolute and relative file paths
Output Formats
JSON Output
Custom annotations appear in the user_feature_overlap
array:
{
"annotationData": [
{
"seq_region_name": "17",
"start": 43094692,
"end": 43094692,
"user_feature_overlap": [
{
"type": "region",
"name": "BRCA1_promoter",
"source": "regulatory_regions.bed",
"chrom": "17",
"region_start": 43090000,
"region_end": 43100000,
"score": 850,
"strand": "+"
},
{
"type": "gene",
"identifier": "BRCA1",
"source": "cancer_genes.txt",
"gene_source_type": "gene_list"
}
]
}
]
}
CSV/TSV Output
Custom annotations appear in the UserFeatureOverlap
column:
OriginalInput,Location,GeneSymbol,UserFeatureOverlap
rs80357906,17:43094692-43094692(1),BRCA1,"region:BRCA1_promoter(regulatory_regions.bed);gene:BRCA1(cancer_genes.txt)"
Format Specification:
- Regions:
region:name(filename)
- Genes:
gene:identifier(filename)
- Multiple: Separated by semicolons (
;
) - Missing Names:
unknown
placeholder used
VCF Output
Custom annotations are included in the VL_CSQ
INFO field following the same format as CSV output.
Advanced Usage
Multiple File Types
Combine different annotation sources for comprehensive analysis:
variant-linker --vcf-input sample.vcf \
--bed-file enhancers.bed \
--bed-file promoters.bed \
--gene-list oncogenes.txt \
--gene-list tumor_suppressors.txt \
--json-genes clinical_panels.json \
--json-gene-mapping '{"identifier":"gene","dataFields":["panel","evidence_level"]}' \
--output JSON
Complex JSON Mapping
Extract multiple metadata fields from structured gene data:
# Full clinical annotation
variant-linker --variant "BRCA1:c.68_69delAG" \
--json-genes comprehensive_gene_data.json \
--json-gene-mapping '{
"identifier": "hgnc_symbol",
"dataFields": [
"disease_panel",
"inheritance_pattern",
"clinical_actionability",
"evidence_level",
"last_reviewed"
]
}' \
--output TSV
Batch Processing with Features
Process large datasets with custom annotations:
# Large-scale variant screening
variant-linker --variants-file population_variants.txt \
--bed-file pathogenic_regions.bed \
--gene-list disease_genes.txt \
--scoring_config_path scoring/clinical_score/ \
--calculate-inheritance \
--ped family.ped \
--output CSV \
--save annotated_results.csv
VCF Workflow with Features
Annotate VCF files and preserve formatting:
# Clinical VCF annotation pipeline
variant-linker --vcf-input patient_variants.vcf \
--bed-file clinvar_regions.bed \
--json-genes acmg_genes.json \
--json-gene-mapping '{"identifier":"gene_symbol","dataFields":["acmg_classification","curation_date"]}' \
--output VCF \
--save annotated_patient_variants.vcf
Error Handling
Common Issues and Solutions
File Not Found
Error: Error parsing BED file /path/to/missing.bed: ENOENT: no such file or directory
- Verify file path is correct
- Check file permissions
- Use absolute paths if needed
Invalid BED Format
Warning: Skipping invalid BED line 5: insufficient columns (2)
- Ensure minimum 3 columns (chr, start, end)
- Verify tab-separated format
- Check for header lines
JSON Mapping Error
Error: --json-gene-mapping is required when using --json-genes
- Always provide mapping parameter with JSON files
- Verify JSON syntax in mapping string
Invalid JSON Mapping
Error: Invalid JSON gene mapping: Unexpected token
- Validate JSON syntax:
echo '{"identifier":"gene"}' | jq
- Escape quotes properly in shell
Best Practices
- File Validation: Test files with small datasets first
- Path Management: Use absolute paths for production pipelines
- Memory Considerations: Large BED files are loaded into memory
- Error Logging: Use debug flags (
-d
,-dd
,-ddd
) for troubleshooting - Performance: Combine multiple small files rather than processing separately
Performance Considerations
Memory Usage
- BED Files: Loaded entirely into memory using interval trees
- Gene Lists: Stored in hash maps for O(1) lookup
- JSON Files: Parsed and indexed by gene identifier
Optimization Tips
- Consolidate Files: Merge multiple small BED files
- Filter Early: Use smaller, focused gene lists
- Batch Processing: Process variants in groups
- Assembly Consistency: Ensure coordinate system matches variant data
Scale Guidelines
- BED Regions: Efficiently handles 100K+ regions
- Gene Lists: Optimized for 10K+ genes
- JSON Metadata: Suitable for complex clinical databases
- Concurrent Files: Multiple files processed in parallel
Integration Examples
Research Pipeline
#!/bin/bash
# Research variant annotation pipeline
VARIANTS="research_cohort.vcf"
ENHANCERS="encode_enhancers.bed"
DISEASE_GENES="gwas_catalog_genes.txt"
OUTPUT_DIR="results"
variant-linker --vcf-input $VARIANTS \
--bed-file $ENHANCERS \
--gene-list $DISEASE_GENES \
--calculate-inheritance \
--output CSV \
--save "${OUTPUT_DIR}/annotated_variants.csv"
Clinical Workflow
#!/bin/bash
# Clinical genetics annotation workflow
PATIENT_VCF="patient_exome.vcf"
ACMG_GENES="acmg_incidental_findings.json"
PATHOGENIC_REGIONS="clinvar_pathogenic_regions.bed"
FAMILY_PED="trio.ped"
variant-linker --vcf-input $PATIENT_VCF \
--ped $FAMILY_PED \
--bed-file $PATHOGENIC_REGIONS \
--json-genes $ACMG_GENES \
--json-gene-mapping '{"identifier":"gene_symbol","dataFields":["acmg_version","recommendation"]}' \
--calculate-inheritance \
--scoring_config_path scoring/clinical_score/ \
--output VCF \
--save "patient_annotated.vcf"
API Usage
The custom annotation feature is also available programmatically:
const { analyzeVariant } = require('variant-linker');
const { loadFeatures } = require('variant-linker/src/featureParser');
// Load features from files
const features = await loadFeatures({
bedFile: ['regulatory_regions.bed'],
geneList: ['cancer_genes.txt'],
jsonGenes: ['gene_panels.json'],
jsonGeneMapping: '{"identifier":"gene_symbol","dataFields":["panel","classification"]}'
});
// Analyze variants with custom features
const result = await analyzeVariant({
variants: ['rs6025', '1-12345-A-G'],
recoderOptions: { vcf_string: '1' },
vepOptions: { CADD: '1', hgvs: '1' },
output: 'JSON',
features: features
});
console.log(result.annotationData[0].user_feature_overlap);
Related Documentation
- CLI Usage Guide - Complete CLI reference
- VCF and PED Files - Working with genomic file formats
- Inheritance Analysis - Family-based variant analysis
- Scoring Engine - Custom variant scoring
- API Usage - Programmatic interface