Python API Usage
With the Python API, you can easily run inference, and fine-tuning tasks in Python.
Model Inference
TF Binding Prediction
Biological Context: Predict binding sites for 919 transcription factors in plant promoter regions—critical for understanding gene regulation and designing synthetic promoters.
Task: Multi-label classification
Model: yangheng/ogb_tfb_finetuned
Single-Sequence Inference
from omnigenbench import ModelHub
# Load the fine-tuned model
model = ModelHub.load("yangheng/ogb_tfb_finetuned")
# Single sequence inference
sequence = "ATCGATCGATCGATCGATCGATCGATCGATCG"
outputs = model.inference(sequence)
print(outputs)
# Output: {'predictions': array([1, 0, 1, 0, ...]), 'probabilities': array([0.92, 0.15, 0.88, ...])}
Output Interpretation
# Access predictions and probabilities
predictions = outputs['predictions'] # Binary predictions (0 or 1) for 919 TFs
probabilities = outputs['probabilities'] # Confidence scores [0-1] for each TF
# Find high-confidence binding sites
high_confidence_sites = [i for i, (pred, prob) in enumerate(zip(predictions, probabilities))
if pred == 1 and prob > 0.8]
print(f"High-confidence TF binding sites: {len(high_confidence_sites)}")
print(f"Top 5 TF indices: {high_confidence_sites[:5]}")
# Output: High-confidence TF binding sites: 34
# Top 5 TF indices: [12, 45, 127, 203, 456]
Biological Interpretation: The promoter region shows enriched TF binding (87 sites with 34 high-confidence predictions), suggesting active regulatory potential. Higher probability scores indicate stronger predicted binding affinity.
Please refer the tutorial for more information about TF binding prediction.
Translation Efficiency Prediction
Biological Context: Predict whether mRNA 5' UTR sequences lead to high or low translation efficiency—essential for optimizing protein expression in biotechnology.
Task: Binary classification
Model: yangheng/ogb_te_finetuned
Predict Translation Efficiency for Multiple Sequences
from omnigenbench import ModelHub
# Load model
model = ModelHub.load("yangheng/ogb_te_finetuned")
# Predict for multiple sequences
sequences = {
"optimized": "ATCGATCGATCGATCGATCGATCGATCGATCG",
"suboptimal": "GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC",
"wild_type": "TATATATATATATATATATATATATATATATAT"
}
for name, seq in sequences.items():
outputs = model.inference(seq)
prediction = outputs["predictions"] # 0 = Low TE, 1 = High TE
probabilities = outputs["probabilities"] # [P(Low), P(High)]
status = "High TE ⚡" if prediction == 1 else "Low TE 🐌"
confidence = probabilities[prediction]
print(f"\n🧬 {name}:")
print(f" {status} | Confidence: {confidence:.3f}")
print(f" P(Low): {probabilities[0]:.3f} | P(High): {probabilities[1]:.3f}")
Expected Output:
🧬 optimized:
High TE ⚡ | Confidence: 0.923
P(Low): 0.077 | P(High): 0.923
🧬 suboptimal:
Low TE 🐌 | Confidence: 0.847
P(Low): 0.847 | P(High): 0.153
🧬 wild_type:
Low TE 🐌 | Confidence: 0.612
P(Low): 0.612 | P(High): 0.388
Please refer the tutorial for more information about TF efficiency prediction.
AutoTrain: Programmatic Fine-Tuning
Here’s how to fine-tune a custom model for TF binding prediction using the AutoTrain API.
from omnigenbench import AutoTrain
# Initialize training
trainer = AutoTrain(
dataset="yangheng/tfb_promoters",
model_name_or_path="zhihan1996/DNABERT-2-117M",
output_dir="./my_finetuned_model",
)
# Run training
metrics = trainer.run()
# Results
print(f"Training completed!")
print(f"Best F1 Score: {metrics['eval_f1']:.4f}")
print(f"Best MCC: {metrics['eval_mcc']:.4f}")
print(f"Model saved to: ./my_finetuned_model")
Expected Output:
Please refer to the tutorial for more details on training models for TF binding prediction.