Scoring Engine

Variant-Linker includes a powerful, configurable scoring system that allows users to create custom scoring formulas for variant prioritization. The scoring engine supports both annotation-level and transcript-level scoring with flexible variable assignment.

Overview

The scoring system consists of two main components:

Formula Configuration: Defines mathematical formulas for calculating scores
Variable Assignment: Maps annotation fields to variables used in formulas

This modular approach allows researchers to:

Create domain-specific scoring models
Integrate multiple annotation sources
Apply complex mathematical transformations
Generate reproducible scoring results

Configuration Structure

Directory Organization

Scoring configurations are organized in directories within the scoring/ folder:

scoring/
├── nephro_variant_score/
│   ├── formula_config.json
│   └── variable_assignment_config.json
└── meta_score_example/
    ├── formula_config.json
    └── variable_assignment_config.json

Formula Configuration File

The formula_config.json file defines mathematical formulas for score calculation:

{
  "@context": "https://schema.org/",
  "@type": "Configuration",
  "formulas": {
    "annotationLevel": [
      {
        "score_name": "mathematical_formula_here"
      },
      {
        "gene_symbol": "gene_symbol"
      }
    ],
    "transcriptLevel": [
      {
        "transcript_score": "formula_for_transcript_scoring"
      }
    ]
  }
}

Variable Assignment Configuration

The variable_assignment_config.json file maps annotation data to variables:

{
  "@context": "https://schema.org/",
  "@type": "Configuration",
  "variables": {
    "annotation.field.path": "variable_name|operation:modifier|default:value",
    "transcript_consequences.*.consequence_terms": "unique:consequence_terms|default:[]"
  }
}

Scoring Levels

Annotation-Level Scoring

Applied to the overall variant annotation, incorporating variant-wide properties:

Use Case: Overall variant pathogenicity scores
Data Sources: Population frequencies, conservation scores, overall impact
Output: Single score per variant

Transcript-Level Scoring

Applied to individual transcript consequences:

Use Case: Transcript-specific impact assessment
Data Sources: Protein predictions, transcript biotype, canonical status
Output: Score per transcript consequence

Variable Assignment Operations

Basic Operations

Operation	Description	Example
`max`	Maximum value from array	`max:cadd_scores`
`min`	Minimum value from array	`min:conservation_scores`
`unique`	Unique values from array	`unique:consequence_terms`
`sum`	Sum of numeric values	`sum:allele_counts`
`avg`	Average of numeric values	`avg:quality_scores`

Default Values

Specify default values when data is missing:

"transcript_consequences.*.cadd_phred": "max:cadd_phred_variant|default:7.226"

Path Expressions

Use JSONPath-like expressions to access nested data:

transcript_consequences.* - All transcript consequences
colocated_variants.0.frequencies.* - First colocated variant frequencies
regulatory_feature_consequences.*.impact - All regulatory impacts

Example: Nephrology Variant Score

Variable Assignment

{
  "variables": {
    "transcript_consequences.*.consequence_terms": "unique:consequence_terms_variant|default:0",
    "transcript_consequences.*.cadd_phred": "max:cadd_phred_variant|default:7.226",
    "transcript_consequences.*.impact": "unique:impact_variant|default:LOW",
    "colocated_variants.0.frequencies.*.gnomade": "gnomade_variant|default:1.626e-05",
    "colocated_variants.0.frequencies.*.gnomadg": "gnomadg_variant|default:5.256e-05"
  }
}

Formula Configuration

{
  "formulas": {
    "annotationLevel": [
      {
        "nephro_variant_score": "1 / (1 + Math.exp(-((-36.30796) + ((gnomade_variant - 0.00658) / 0.05959) * (-309.33539) + ((gnomadg_variant - 0.02425) / 0.11003) * (-2.54581) + ... )))"
      }
    ]
  }
}

This example implements a logistic regression model incorporating:

Population frequencies (gnomAD exomes and genomes)
Consequence type indicators
CADD pathogenicity scores
Impact severity levels

Usage Examples

Apply Custom Scoring

# Use nephrology-specific scoring
variant-linker \
  --variants-file variants.txt \
  --scoring_config_path scoring/nephro_variant_score/ \
  --output CSV \
  --save scored_variants.csv

Compare Multiple Scoring Models

# Apply different scoring models
variant-linker --variant "rs6025" --scoring_config_path scoring/nephro_variant_score/ --output JSON
variant-linker --variant "rs6025" --scoring_config_path scoring/meta_score_example/ --output JSON

Batch Scoring

# Score large batch of variants
variant-linker \
  --vcf-input large_cohort.vcf \
  --scoring_config_path scoring/nephro_variant_score/ \
  --output VCF \
  --save scored_cohort.vcf

Creating Custom Scoring Models

Step 1: Create Directory Structure

mkdir scoring/my_custom_score

Step 2: Define Variable Assignments

Create scoring/my_custom_score/variable_assignment_config.json:

{
  "@context": "https://schema.org/",
  "@type": "Configuration",
  "variables": {
    "transcript_consequences.*.sift_prediction": "unique:sift_prediction|default:unknown",
    "transcript_consequences.*.polyphen_prediction": "unique:polyphen_prediction|default:unknown",
    "transcript_consequences.*.cadd_phred": "max:cadd_phred|default:0",
    "colocated_variants.0.frequencies.*.gnomade": "gnomade_freq|default:0.001"
  }
}

Step 3: Create Scoring Formula

Create scoring/my_custom_score/formula_config.json:

{
  "@context": "https://schema.org/",
  "@type": "Configuration",
  "formulas": {
    "annotationLevel": [
      {
        "pathogenicity_score": "cadd_phred * 0.1 + (gnomade_freq < 0.001 ? 10 : 0) + (sift_prediction.includes('deleterious') ? 5 : 0) + (polyphen_prediction.includes('damaging') ? 5 : 0)"
      },
      {
        "gene_symbol": "gene_symbol"
      }
    ]
  }
}

Step 4: Test Your Model

variant-linker \
  --variant "rs6025" \
  --scoring_config_path scoring/my_custom_score/ \
  --output JSON

Advanced Formula Features

Conditional Logic

Use JavaScript conditional operators in formulas:

// Ternary operator
"score": "cadd_phred > 20 ? 1 : 0"

// Multiple conditions
"risk_score": "gnomad_freq < 0.001 ? (cadd_phred > 15 ? 'high' : 'medium') : 'low'"

Array Operations

Work with arrays of values:

// Check if array includes value
"has_missense": "consequence_terms.includes('missense_variant') ? 1 : 0"

// Map and reduce operations
"max_impact": "Math.max(...impact_scores.map(x => x || 0))"

// Array filtering
"severe_consequences": "consequence_terms.filter(x => ['stop_gained', 'frameshift_variant'].includes(x)).length"

Mathematical Functions

Use standard JavaScript Math functions:

// Logarithmic scaling
"log_score": "Math.log10(cadd_phred + 1)"

// Exponential functions
"sigmoid_score": "1 / (1 + Math.exp(-cadd_phred))"

// Power functions
"power_score": "Math.pow(cadd_phred / 10, 2)"

Output Integration

CSV/TSV Output

Scores are included as additional columns in tabular output:

OriginalInput,VariantID,GeneSymbol,nephro_variant_score,pathogenicity_score
rs6025,rs6025,F5,0.89,15.2

JSON Output

Scores are added to the annotation object:

{
  "scores": {
    "nephro_variant_score": 0.89,
    "pathogenicity_score": 15.2
  }
}

VCF Output

Scores are included in the VL_CSQ INFO field:

VL_CSQ=T|missense_variant|MODERATE|F5|...|nephro_variant_score=0.89

Best Practices

Model Development

Start Simple: Begin with basic formulas and add complexity gradually
Validate Logic: Test formulas with known variants
Document Assumptions: Comment complex formulas thoroughly
Version Control: Track scoring model versions

Variable Selection

Relevant Features: Choose variables relevant to your research question
Data Quality: Ensure reliable annotation sources
Missing Data: Handle missing values appropriately with defaults
Scale Consistency: Normalize variables to similar scales

Formula Design

Interpretability: Keep formulas interpretable when possible
Computational Efficiency: Avoid overly complex calculations
Boundary Conditions: Test edge cases and missing data scenarios
Score Interpretation: Define clear score interpretation guidelines

Troubleshooting

Common Issues

Formula Syntax Errors

Check JavaScript syntax in formulas
Verify variable names match assignments
Test formulas with simple examples

Missing Variables

Ensure variable paths exist in annotation data
Use appropriate default values
Check JSONPath expressions

Score Calculation Errors

Validate mathematical operations
Handle division by zero cases
Check for null/undefined values

Debug Mode

Use debug mode to troubleshoot scoring issues:

variant-linker \
  --variant "rs6025" \
  --scoring_config_path scoring/my_custom_score/ \
  --debug 3 \
  --output JSON

Debug output includes:

Variable assignment details
Formula evaluation steps
Error messages and stack traces

CNV Scoring

Variant-Linker includes specialized support for scoring Copy Number Variants (CNVs) with access to structural variant-specific annotations.

CNV-Specific Variables

CNV scoring configurations can access additional variables not available for point mutations:

Variable	Description	Source
`bp_overlap`	Base pairs overlapping with gene features	VEP transcript consequences
`percentage_overlap`	Percentage of feature overlap	VEP transcript consequences
`phenotypes`	Associated disease phenotypes	VEP top-level annotation
`phaplo_score`	Haploinsufficiency score	VEP dosage sensitivity
`ptriplo_score`	Triplosensitivity score	VEP dosage sensitivity

Example CNV Scoring Configuration

The included cnv_score_example demonstrates pathogenicity scoring for structural variants:

Variable Assignment (scoring/cnv_score_example/variable_assignment_config.json):

{
  "aggregates": {
    "consequence_terms": "transcript_consequences.*.consequence_terms:unique",
    "bp_overlap": "transcript_consequences.*.bp_overlap:max",
    "percentage_overlap": "transcript_consequences.*.percentage_overlap:max"
  },
  "transcriptFields": {
    "dosage_gene": "dosage_sensitivity.gene_name",
    "phaplo_score": "dosage_sensitivity.phaplo",
    "ptriplo_score": "dosage_sensitivity.ptriplo",
    "phenotypes": "phenotypes"
  }
}

Scoring Formulas (scoring/cnv_score_example/formula_config.json):

{
  "formulas": {
    "annotationLevel": [{
      "cnv_pathogenicity_score": "const baseScore = consequence_terms.includes('feature_truncation') ? 20 : (consequence_terms.includes('feature_elongation') ? 15 : 0); const overlapScore = Math.min(bp_overlap / 10000, 10); const dosageScore = (phaplo_score || 0) * 5 + (ptriplo_score || 0) * 3; const phenotypeScore = phenotypes && phenotypes.length > 0 ? 5 : 0; return baseScore + overlapScore + dosageScore + phenotypeScore;"
    }],
    "transcriptLevel": [{
      "transcript_cnv_impact": "const hasFeatureTruncation = consequence_terms.includes('feature_truncation'); const hasHighOverlap = (bp_overlap || 0) > 50000; const isHighConfidenceGene = (phaplo_score || 0) >= 3 || (ptriplo_score || 0) >= 3; return hasFeatureTruncation && hasHighOverlap && isHighConfidenceGene ? 'HIGH_IMPACT' : (hasFeatureTruncation || (hasHighOverlap && isHighConfidenceGene) ? 'MODERATE_IMPACT' : 'LOW_IMPACT');"
    }]
  }
}

CNV Scoring Usage

# Score a deletion with CNV-specific model
variant-linker \
  --variant "7:117559600-117559609:DEL" \
  --scoring_config_path scoring/cnv_score_example/ \
  --vep_params "Phenotypes=1,numbers=1" \
  --output JSON

# Batch CNV scoring
variant-linker \
  --variants-file cnv_variants.txt \
  --scoring_config_path scoring/cnv_score_example/ \
  --output CSV

CNV Scoring Strategy

The example CNV scoring model considers:

Functional Impact: Feature truncation/elongation consequences
Overlap Significance: Base pairs and percentage overlap with gene features
Dosage Sensitivity: Haploinsufficiency and triplosensitivity scores
Clinical Relevance: Associated disease phenotypes

This multi-factor approach helps prioritize CNVs likely to have clinical significance.

Integration with Other Features

Inheritance Analysis

Combine scoring with inheritance pattern analysis:

variant-linker \
  --vcf-input family.vcf \
  --ped family.ped \
  --calculate-inheritance \
  --scoring_config_path scoring/nephro_variant_score/ \
  --output CSV

Filtering

Use scores in filtering criteria:

# Filter by score threshold (requires post-processing)
variant-linker \
  --variants-file variants.txt \
  --scoring_config_path scoring/nephro_variant_score/ \
  --output JSON | jq '.[] | select(.scores.nephro_variant_score > 0.8)'

Performance Considerations

Complex Formulas: Very complex formulas may impact processing speed
Array Operations: Large arrays in formulas can be memory-intensive
Batch Processing: Scoring adds computational overhead to batch jobs
Caching: Consider caching for repeated scoring operations

Next Steps

Explore the CLI usage guide for more command examples
Learn about benchmarking to measure scoring performance
Check out inheritance analysis for family-based prioritization

Overview​

Configuration Structure​

Directory Organization​

Formula Configuration File​

Variable Assignment Configuration​

Scoring Levels​

Annotation-Level Scoring​

Transcript-Level Scoring​

Variable Assignment Operations​

Basic Operations​

Default Values​

Path Expressions​

Example: Nephrology Variant Score​

Variable Assignment​

Formula Configuration​

Usage Examples​

Apply Custom Scoring​

Compare Multiple Scoring Models​

Batch Scoring​

Creating Custom Scoring Models​

Step 1: Create Directory Structure​

Step 2: Define Variable Assignments​

Step 3: Create Scoring Formula​

Step 4: Test Your Model​

Advanced Formula Features​

Conditional Logic​

Array Operations​

Mathematical Functions​

Output Integration​

CSV/TSV Output​

JSON Output​

VCF Output​

Best Practices​

Model Development​

Variable Selection​

Formula Design​

Troubleshooting​

Common Issues​

Debug Mode​

CNV Scoring​

CNV-Specific Variables​

Example CNV Scoring Configuration​

CNV Scoring Usage​

CNV Scoring Strategy​

Integration with Other Features​

Inheritance Analysis​

Filtering​

Performance Considerations​

Next Steps​