Skip to content

OmniGenBench Standardized Workflow

OmniGenBench uses a unified, four-step workflow that applies across all genomic prediction tasks—from sequence classification to token labeling. This consistent structure keeps code clean, reusable, and easy to understand.

The OmniGenBench workflow can be summarized as:

  1. Data Preparation
  2. Model Initialization
  3. Model Training
  4. Model Inference

Step 1: Data Preparation

What happens in this step?

  • Environment Setup: Install necessary Python packages
  • Data Acquisition: Download or load datasets
  • Data Preprocessing: Sequence tokenization, label encoding
  • Data Loading: Create PyTorch DataLoaders

Example Code

from omnigenbench import OmniTokenizer, OmniDatasetForSequenceClassification

tokenizer = OmniTokenizer.from_pretrained(
    "yangheng/OmniGenome-52M",
    trust_remote_code=True
)

dataset = OmniDatasetForSequenceClassification(
    data_file="train.json",     # Local JSON dataset
    tokenizer=tokenizer,
    max_length=512
)

Step 2: Model Initialization

What happens in this step?

  • Tokenizer Loading: Use the same tokenizer as data preparation
  • Base Model Loading: Load pre-trained OmniGenome models
  • Task Adaptation: Add task-specific output layers
  • Model Configuration: Set model parameters

Example Code

from omnigenbench import OmniModelForSequenceClassification

model = OmniModelForSequenceClassification(
    model="yangheng/OmniGenome-52M",
    num_labels=2,      # Adjust depending on classification task
    trust_remote_code=True
)


Step 3: Model Training

What happens in this step?

  • Training Configuration: Set learning rate, batch size, and other hyperparameters

  • Evaluation Metrics: Choose appropriate performance evaluation metrics

  • Training Loop: Execute model fine-tuning

  • Model Saving: Save best-performing models

Example Code

from omnigenbench import OmniTrainer, ClassificationMetric

trainer = OmniTrainer(
    model=model,
    train_dataset=dataset,
    eval_dataset=valid_dataset,
    compute_metrics=[ClassificationMetric().f1_score],
    output_dir="./my_finetuned_model"
)

trainer.train()
trainer.save_model("./my_finetuned_model")


Step 4: Model Inference

What happens in this step?

  • Model Loading: Load trained models

  • Input Preprocessing: Prepare new sequence data

  • Prediction Generation: Obtain model outputs

  • Result Interpretation: Parse and visualize prediction results

Example Code

from omnigenbench import OmniModelForSequenceClassification

# Load fine-tuned model
inference_model = OmniModelForSequenceClassification(
    model="./my_finetuned_model",
    num_labels=2,
    trust_remote_code=True
)

# Make predictions
prediction = inference_model.predict("ATCGATCGATCG")
print(f"Predicted class: {prediction}")


Workflow Advantages

This standardized workflow brings multiple advantages:

Consistency: All tasks use the same pattern, reducing learning costs

Reproducibility: Standardized steps ensure reproducible results

Scalability: Easy to adapt to new tasks and datasets

Best Practices: Integrates proven machine learning best practices

Error Reduction: Standardization reduces common errors