Variant Effect Prediction with OmniGenBench

This notebook demonstrates how to use various genomic foundation models for variant effect prediction. It processes genomic variants from a BED file, uses a reference genome, and predicts the functional effects of variants using deep learning models.

Dataset Description: The dataset for this task consists of a BED file (variant_effects_expression.bed) containing human genetic variants associated with diseases, and the human reference genome (hg38). The BED file specifies the chromosome, start/end positions, reference allele, and alternative allele for each variant. The goal is to predict the functional impact of these variants by comparing the model's embeddings of the reference sequence versus the altered sequence. The data is sourced from the yangheng/variant_effect_prediction dataset on Hugging Face.

Estimated Runtime: This notebook performs inference only (no training). The runtime depends on the model size and the number of variants processed. On a single NVIDIA RTX 4090 GPU, running the analysis on the full dataset (approximately 18k variant samples) takes about 15-25 minutes. If you uncomment the line to run on a small sample (100 variants), the process should complete in under 2 minutes.

1. Setup & Installation

This cell contains the necessary packages for running the notebook. If you have already installed them, you can skip this step. Otherwise, uncomment and run the cell.

pip install torch transformers pandas autocuda multimolecule biopython scipy scikit-learn tqdm dill findfile requests

2. Import Libraries

Import all the necessary libraries for genomic data processing, model inference, and analysis.

import warnings
import findfile, autocuda
import importlib
utils_spec = importlib.util.spec_from_file_location("utils", "utils.py")
utils = importlib.util.module_from_spec(utils_spec)
utils_spec.loader.exec_module(utils)

warnings.filterwarnings('ignore')
print("Libraries imported successfully!")

3. Configuration & Data Download

Set up the analysis parameters, file paths, and model selection here. You can easily change the model_name to test different genomic foundation models.

# Using utils for reusable logic
from utils import download_ncbi_reference_genome, download_vep_dataset
print("Core classes and functions imported from utils.")

local_dir = "vep_prediction_dataset"
download_vep_dataset(local_dir)
download_ncbi_reference_genome()

# --- Main Configuration ---
BED_FILE = findfile.find_cwd_file("variant_effects_expression.bed")
FASTA_FILE = findfile.find_cwd_file("hg38.fa")

4. Model Selection

Choose a model to evaluate. All core processing (data loading, embeddings, scoring) is handled in utils.py.

# --- Available Models for Testing ---
AVAILABLE_MODELS = [
    'yangheng/OmniGenome-52M',
    'yangheng/OmniGenome-186M',
    'yangheng/OmniGenome-v1.5',
]
MODEL_NAME = AVAILABLE_MODELS[0]  # Model to use for predictions
print(f"Selected model: {MODEL_NAME}")

5. Main Analysis Pipeline

Run the VEP pipeline using utils.run_vep_analysis for a concise workflow.

# Import main pipeline from utils for a concise demo
from utils import run_vep_analysis

print("Main analysis pipeline imported from utils.")

# Setup device
compute_device = autocuda.auto_cuda()

print(f"Starting analysis on device: {compute_device}")
print("=" * 50)

# Run the analysis
results_df = run_vep_analysis(
    model_name=MODEL_NAME,
    bed_file=BED_FILE,
    fasta_file=FASTA_FILE,
    context_size=200,  # Context size (in base pairs) to include on each side of the variant
    batch_size=16,  # Batch size for model inference
    device=compute_device
)

print("=" * 50)
print("Analysis completed!")

# Save results to CSV
output_filename = f"{MODEL_NAME.split('/')[-1]}_vep_predictions.csv"
results_df.to_csv(output_filename, index=False)
print(f"Results saved to: {output_filename}")

6. Execute & Save Results

Run the pipeline with the selected configuration and save results to a dynamically named CSV file. The filename is generated based on the selected model to avoid overwriting results from different runs.

7. Results Overview

Preview the results and display basic statistics for the computed distances. This provides insights into the model's sensitivity to the variants.

print("Results Summary:")
print(results_df[['chromosome', 'start', 'end', 'ref', 'alt', 'cls_dist', 'mut_dist']].describe())

print("\nFirst 5 results:")
display(results_df[['chromosome', 'start', 'end', 'ref', 'alt', 'cls_dist', 'mut_dist']].head())

8. Visualization

Visualize the distributions of cls_dist and mut_dist to understand the model's behavior and sensitivity to genomic variants. These plots help identify patterns in the embeddings generated by the model.

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(results_df['cls_dist'].dropna(), bins=50, kde=True, ax=axes[0], color='skyblue')
axes[0].set_title('Distribution of CLS Distances')
axes[0].set_xlabel('Cosine Distance (CLS Embedding)')
axes[0].set_ylabel('Frequency')

sns.histplot(results_df['mut_dist'].dropna(), bins=50, kde=True, ax=axes[1], color='salmon')
axes[1].set_title('Distribution of Mutation Position Distances')
axes[1].set_xlabel('Cosine Distance (Mutation Embedding)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

References

BEND: Benchmarking DNA Language Models on biologically meaningful tasks