Data, Task, and Model Relationships
Building successful genomic deep learning pipelines requires using the correct data format, selecting the appropriate task type, and pairing it with a suitable model architecture. OmniGenBench provides a unified framework for aligning these components.
OmniGenBench Model Family
OmniGenBench includes model variants specialized for different machine learning task types commonly used in genomics.
| Task Type | OmniGenBench Model | Data Format Requirements | Application Scenarios |
|---|---|---|---|
| Sequence Classification | OmniModelForSequenceClassification |
{sequence, label} |
Promoter identification, functional classification |
| Sequence Regression | OmniModelForSequenceRegression |
{sequence, value} |
Expression prediction, stability score prediction |
| Token Classification | OmniModelForTokenClassification |
{sequence, labels_per_position} |
Binding site detection, RNA modifications |
| Multi-label Classification | OmniModelForMultiLabelClassification |
{sequence, label_vector} |
Multi-functional annotation, localization prediction |
These models share the same underlying OmniGenome pre-trained encoder, while the prediction head varies depending on the downstream task.
Standard Dataset Formats
OmniGenBench datasets follow standardized CSV schemas to ensure compatibility across all models.
Sequence Classification Format
Sequence Regression Format
Token Classification Format
sequence: DNA/RNA/amino acid sequence
labels: comma-separated per-position labels
split: train/valid/test
Model Selection Decision Tree
You can determine the correct model type by answering two questions:
-
Do I want to classify/predict something about an entire sequence? → Classification or regression model
-
Do I want to annotate each nucleotide/residue? → Token classification model
-
Do I want to generate a new sequence? → Sequence-to-sequence model (coming soon)
Practical Application Examples
Here are typical real-world scenarios showing how to match tasks, models, and data formats.
Case 1: Translation Efficiency Prediction
Question: “Is this mRNA sequence’s translation efficiency high or low?”
Task Type: Sequence classification (binary)
Model Choice: OmniModelForSequenceClassification
Data Format Example:
{ "sequence": "AUGCCC...", "label": 1 }
Case 2: Gene Expression Level Prediction
Question: “What expression level will this promoter produce?”
Task Type: Sequence regression
Model Choice: OmniModelForSequenceRegression
Data Format Example:
{ "sequence": "ATGCCC...", "value": 8.5 }
Case 3: Transcription Factor Binding Site Prediction
Question: “Which positions in the sequence bind a specific transcription factor?”
Task Type: Token classification (per-nucleotide labeling)
Model Choice: OmniModelForTokenClassification
Data Format Example: {"sequence": "ATGCCC...", "labels": [0, 0, 1, 1, 0, 0, ...]}
Summary
By aligning: - Task type (classification, regression, token labeling)
- Dataset format (sequence-level or position-level)
- Model architecture (OmniModel variants)
OmniGenBench ensures consistency, reproducibility, and ease of experimentation across genomic deep learning workflows.