Data, Task, and Model Relationships

Building successful genomic deep learning pipelines requires using the correct data format, selecting the appropriate task type, and pairing it with a suitable model architecture. OmniGenBench provides a unified framework for aligning these components.

OmniGenBench Model Family

OmniGenBench includes model variants specialized for different machine learning task types commonly used in genomics.

Task Type	OmniGenBench Model	Data Format Requirements	Application Scenarios
Sequence Classification	`OmniModelForSequenceClassification`	`{sequence, label}`	Promoter identification, functional classification
Sequence Regression	`OmniModelForSequenceRegression`	`{sequence, value}`	Expression prediction, stability score prediction
Token Classification	`OmniModelForTokenClassification`	`{sequence, labels_per_position}`	Binding site detection, RNA modifications
Multi-label Classification	`OmniModelForMultiLabelClassification`	`{sequence, label_vector}`	Multi-functional annotation, localization prediction

These models share the same underlying OmniGenome pre-trained encoder, while the prediction head varies depending on the downstream task.

Standard Dataset Formats

OmniGenBench datasets follow standardized CSV schemas to ensure compatibility across all models.

Sequence Classification Format

sequence,label,id,split
ATCGATCGATCG,1,seq_001,train
GCGCGCGCGCGC,0,seq_002,valid

Sequence Regression Format

sequence,value,id,split
ATCGATCGATCG,8.5,seq_001,train
GCGCGCGCGCGC,2.3,seq_002,valid

Token Classification Format

sequence,labels,id,split
ATCGATCGATCG,"0,0,1,1,0,0,0,0,0,0,0,0",seq_001,train

sequence: DNA/RNA/amino acid sequence

labels: comma-separated per-position labels

split: train/valid/test

Model Selection Decision Tree

You can determine the correct model type by answering two questions:

Do I want to classify/predict something about an entire sequence? → Classification or regression model
Do I want to annotate each nucleotide/residue? → Token classification model
Do I want to generate a new sequence? → Sequence-to-sequence model (coming soon)

Practical Application Examples

Here are typical real-world scenarios showing how to match tasks, models, and data formats.

Case 1: Translation Efficiency Prediction

Question: “Is this mRNA sequence’s translation efficiency high or low?”

Task Type: Sequence classification (binary)

Model Choice: OmniModelForSequenceClassification

Data Format Example: { "sequence": "AUGCCC...", "label": 1 }

Case 2: Gene Expression Level Prediction

Question: “What expression level will this promoter produce?”

Task Type: Sequence regression

Model Choice: OmniModelForSequenceRegression

Data Format Example: { "sequence": "ATGCCC...", "value": 8.5 }

Case 3: Transcription Factor Binding Site Prediction

Question: “Which positions in the sequence bind a specific transcription factor?”

Task Type: Token classification (per-nucleotide labeling)

Model Choice: OmniModelForTokenClassification

Data Format Example: {"sequence": "ATGCCC...", "labels": [0, 0, 1, 1, 0, 0, ...]}

Summary

By aligning: - Task type (classification, regression, token labeling)

- Dataset format (sequence-level or position-level)

- Model architecture (OmniModel variants)

OmniGenBench ensures consistency, reproducibility, and ease of experimentation across genomic deep learning workflows.