OmniGenBench Standardized Workflow
OmniGenBench uses a unified, four-step workflow that applies across all genomic prediction tasks—from sequence classification to token labeling. This consistent structure keeps code clean, reusable, and easy to understand.
The OmniGenBench workflow can be summarized as:
- Data Preparation
- Model Initialization
- Model Training
- Model Inference
Step 1: Data Preparation
What happens in this step?
- Environment Setup: Install necessary Python packages
- Data Acquisition: Download or load datasets
- Data Preprocessing: Sequence tokenization, label encoding
- Data Loading: Create PyTorch DataLoaders
Example Code
from omnigenbench import OmniTokenizer, OmniDatasetForSequenceClassification
tokenizer = OmniTokenizer.from_pretrained(
"yangheng/OmniGenome-52M",
trust_remote_code=True
)
dataset = OmniDatasetForSequenceClassification(
data_file="train.json", # Local JSON dataset
tokenizer=tokenizer,
max_length=512
)
Step 2: Model Initialization
What happens in this step?
- Tokenizer Loading: Use the same tokenizer as data preparation
- Base Model Loading: Load pre-trained OmniGenome models
- Task Adaptation: Add task-specific output layers
- Model Configuration: Set model parameters
Example Code
from omnigenbench import OmniModelForSequenceClassification
model = OmniModelForSequenceClassification(
model="yangheng/OmniGenome-52M",
num_labels=2, # Adjust depending on classification task
trust_remote_code=True
)
Step 3: Model Training
What happens in this step?
-
Training Configuration: Set learning rate, batch size, and other hyperparameters
-
Evaluation Metrics: Choose appropriate performance evaluation metrics
-
Training Loop: Execute model fine-tuning
-
Model Saving: Save best-performing models
Example Code
from omnigenbench import OmniTrainer, ClassificationMetric
trainer = OmniTrainer(
model=model,
train_dataset=dataset,
eval_dataset=valid_dataset,
compute_metrics=[ClassificationMetric().f1_score],
output_dir="./my_finetuned_model"
)
trainer.train()
trainer.save_model("./my_finetuned_model")
Step 4: Model Inference
What happens in this step?
-
Model Loading: Load trained models
-
Input Preprocessing: Prepare new sequence data
-
Prediction Generation: Obtain model outputs
-
Result Interpretation: Parse and visualize prediction results
Example Code
from omnigenbench import OmniModelForSequenceClassification
# Load fine-tuned model
inference_model = OmniModelForSequenceClassification(
model="./my_finetuned_model",
num_labels=2,
trust_remote_code=True
)
# Make predictions
prediction = inference_model.predict("ATCGATCGATCG")
print(f"Predicted class: {prediction}")
Workflow Advantages
This standardized workflow brings multiple advantages:
Consistency: All tasks use the same pattern, reducing learning costs
Reproducibility: Standardized steps ensure reproducible results
Scalability: Easy to adapt to new tasks and datasets
Best Practices: Integrates proven machine learning best practices
Error Reduction: Standardization reduces common errors