OmniGenBench Standardized Workflow

OmniGenBench uses a unified, four-step workflow that applies across all genomic prediction tasks—from sequence classification to token labeling. This consistent structure keeps code clean, reusable, and easy to understand.

The OmniGenBench workflow can be summarized as:

Data Preparation
Model Initialization
Model Training
Model Inference

Step 1: Data Preparation

What happens in this step?

Environment Setup: Install necessary Python packages
Data Acquisition: Download or load datasets
Data Preprocessing: Sequence tokenization, label encoding
Data Loading: Create PyTorch DataLoaders

Example Code

from omnigenbench import OmniTokenizer, OmniDatasetForSequenceClassification

tokenizer = OmniTokenizer.from_pretrained(
    "yangheng/OmniGenome-52M",
    trust_remote_code=True
)

dataset = OmniDatasetForSequenceClassification(
    data_file="train.json",     # Local JSON dataset
    tokenizer=tokenizer,
    max_length=512
)

Step 2: Model Initialization

What happens in this step?

Tokenizer Loading: Use the same tokenizer as data preparation
Base Model Loading: Load pre-trained OmniGenome models
Task Adaptation: Add task-specific output layers
Model Configuration: Set model parameters

Example Code

from omnigenbench import OmniModelForSequenceClassification

model = OmniModelForSequenceClassification(
    model="yangheng/OmniGenome-52M",
    num_labels=2,      # Adjust depending on classification task
    trust_remote_code=True
)

Step 3: Model Training

What happens in this step?

Training Configuration: Set learning rate, batch size, and other hyperparameters
Evaluation Metrics: Choose appropriate performance evaluation metrics
Training Loop: Execute model fine-tuning
Model Saving: Save best-performing models

Example Code

from omnigenbench import OmniTrainer, ClassificationMetric

trainer = OmniTrainer(
    model=model,
    train_dataset=dataset,
    eval_dataset=valid_dataset,
    compute_metrics=[ClassificationMetric().f1_score],
    output_dir="./my_finetuned_model"
)

trainer.train()
trainer.save_model("./my_finetuned_model")

Step 4: Model Inference

What happens in this step?

Model Loading: Load trained models
Input Preprocessing: Prepare new sequence data
Prediction Generation: Obtain model outputs
Result Interpretation: Parse and visualize prediction results

Example Code

from omnigenbench import OmniModelForSequenceClassification

# Load fine-tuned model
inference_model = OmniModelForSequenceClassification(
    model="./my_finetuned_model",
    num_labels=2,
    trust_remote_code=True
)

# Make predictions
prediction = inference_model.predict("ATCGATCGATCG")
print(f"Predicted class: {prediction}")

Workflow Advantages

This standardized workflow brings multiple advantages:

Consistency: All tasks use the same pattern, reducing learning costs

Reproducibility: Standardized steps ensure reproducible results

Scalability: Easy to adapt to new tasks and datasets

Best Practices: Integrates proven machine learning best practices

Error Reduction: Standardization reduces common errors