Python API Usage

With the Python API, you can easily run inference, and fine-tuning tasks in Python.

Model Inference

TF Binding Prediction

Biological Context: Predict binding sites for 919 transcription factors in plant promoter regions—critical for understanding gene regulation and designing synthetic promoters.

Task: Multi-label classification

Model: yangheng/ogb_tfb_finetuned

Single-Sequence Inference

from omnigenbench import ModelHub

# Load the fine-tuned model
model = ModelHub.load("yangheng/ogb_tfb_finetuned")

# Single sequence inference
sequence = "ATCGATCGATCGATCGATCGATCGATCGATCG"
outputs = model.inference(sequence)

print(outputs)
# Output: {'predictions': array([1, 0, 1, 0, ...]), 'probabilities': array([0.92, 0.15, 0.88, ...])}

Output Interpretation

# Access predictions and probabilities
predictions = outputs['predictions']  # Binary predictions (0 or 1) for 919 TFs
probabilities = outputs['probabilities']  # Confidence scores [0-1] for each TF

# Find high-confidence binding sites
high_confidence_sites = [i for i, (pred, prob) in enumerate(zip(predictions, probabilities)) 
                         if pred == 1 and prob > 0.8]
print(f"High-confidence TF binding sites: {len(high_confidence_sites)}")
print(f"Top 5 TF indices: {high_confidence_sites[:5]}")
# Output: High-confidence TF binding sites: 34
#         Top 5 TF indices: [12, 45, 127, 203, 456]

Biological Interpretation: The promoter region shows enriched TF binding (87 sites with 34 high-confidence predictions), suggesting active regulatory potential. Higher probability scores indicate stronger predicted binding affinity.

Please refer the tutorial for more information about TF binding prediction.

Translation Efficiency Prediction

Biological Context: Predict whether mRNA 5' UTR sequences lead to high or low translation efficiency—essential for optimizing protein expression in biotechnology.

Task: Binary classification

Model: yangheng/ogb_te_finetuned

Predict Translation Efficiency for Multiple Sequences

from omnigenbench import ModelHub

# Load model
model = ModelHub.load("yangheng/ogb_te_finetuned")

# Predict for multiple sequences
sequences = {
    "optimized": "ATCGATCGATCGATCGATCGATCGATCGATCG",
    "suboptimal": "GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC",
    "wild_type": "TATATATATATATATATATATATATATATATAT"
}

for name, seq in sequences.items():
    outputs = model.inference(seq)
    prediction = outputs["predictions"]         # 0 = Low TE, 1 = High TE
    probabilities = outputs["probabilities"]    # [P(Low), P(High)]

    status = "High TE ⚡" if prediction == 1 else "Low TE 🐌"
    confidence = probabilities[prediction]

    print(f"\n🧬 {name}:")
    print(f"   {status} | Confidence: {confidence:.3f}")
    print(f"   P(Low): {probabilities[0]:.3f} | P(High): {probabilities[1]:.3f}")

Expected Output:

🧬 optimized:
   High TE ⚡ | Confidence: 0.923
   P(Low): 0.077 | P(High): 0.923

🧬 suboptimal:
   Low TE 🐌 | Confidence: 0.847
   P(Low): 0.847 | P(High): 0.153

🧬 wild_type:
   Low TE 🐌 | Confidence: 0.612
   P(Low): 0.612 | P(High): 0.388

Biological Interpretation: The model correctly distinguishes optimized sequences (high TE with 92.3% confidence) from suboptimal structures, demonstrating utility for synthetic biology design.

Please refer the tutorial for more information about TF efficiency prediction.

AutoTrain: Programmatic Fine-Tuning

Here’s how to fine-tune a custom model for TF binding prediction using the AutoTrain API.

from omnigenbench import AutoTrain

# Initialize training
trainer = AutoTrain(
    dataset="yangheng/tfb_promoters",
    model_name_or_path="zhihan1996/DNABERT-2-117M",
    output_dir="./my_finetuned_model",
)

# Run training
metrics = trainer.run()

# Results
print(f"Training completed!")
print(f"Best F1 Score: {metrics['eval_f1']:.4f}")
print(f"Best MCC: {metrics['eval_mcc']:.4f}")
print(f"Model saved to: ./my_finetuned_model")

Expected Output:

Please refer to the tutorial for more details on training models for TF binding prediction.