TFB Prediction Tutorial 1/4: From Biological Questions to Data Pipelines

Welcome to the first tutorial in our four-part series. This guide focuses on the foundational and most critical step of any computational biology project: understanding and preparing your data.

Prerequisite

If you're new to OmniGenBench or foundation models, please read the Fundamental Concepts Tutorial first. It covers language model principles, model-task mapping, genomic data types, and PlantRNA-FM basics.

Before we write any code, we must first understand the landscape of biological data and how to frame biological questions as machine learning tasks.

The Language of Life: DNA and RNA

At its core, genomics is the study of sequences. The primary types are:

DNA (Deoxyribonucleic acid): The blueprint of life, composed of four bases (A, T, C, G). It carries the genetic instructions for the development, functioning, growth, and reproduction of all known organisms.
RNA (Ribonucleic acid): Often a messenger, RNA (composed of A, U, C, G) plays a crucial role in translating the instructions from DNA into proteins. It can also have structural and regulatory roles.

These sequences are not random strings, and they contain complex patterns, grammar, and syntax. Our goal is to teach a machine to read and understand this "language of life."

Framing Biological Questions as ML Tasks

A biological question must be translated into a well-defined machine learning task. Here are the most common types:

Task Type	Biological Question	Example
Sequence Classification	Does this sequence have property X?	Is this sequence a promoter? (Yes/No)
Multi-Label Classification	Which properties apply?	Which of 100 TFs bind this sequence?
Sequence Regression	Predict a numerical value	Translation efficiency score
Token Classification	Label each base	Identify gene boundaries
Sequence-to-Sequence	Transform sequence	DNA → protein sequence

The OmniGenBench Toolbox: Available Models for Every Task

OmniGenBench provides a suite of pre-configured models, each designed for a specific task. This saves you from having to build them from scratch.

Task	OmniGenBench Model	When to Use
Sequence Classification	`OmniModelForSequenceClassification`	One label per sequence
Multi-Label Classification	`OmniModelForMultiLabelSequenceClassification`	Multiple labels per sequence (TFB prediction)
Sequence Regression	`OmniModelForSequenceRegression`	Predict a continuous value
Token Classification	`OmniModelForTokenClassification`	Label each nucleotide
Token Regression	`OmniModelForTokenRegression`	Predict numbers per nucleotide
Seq2Seq	`OmniModelForSeq2Seq`	Generate sequence outputs

By understanding this mapping, you can quickly select the right tool for your biological problem.

Our Task: Why TFB is Multi-Label Sequence Classification

Now, let's apply this to our tutorial's goal: Transcription Factor Binding (TFB) Prediction.

It's a Sequence task: Our input is a DNA sequence.
It's a Classification task: For each transcription factor, we are asking a "Yes/No" question: does it bind?
It's a Multi-Label task: A single DNA sequence can be a binding site for multiple different TFs simultaneously. We aren't just picking one from a list; we are identifying all potential binders.

Therefore, the correct framing is Multi-Label Sequence Classification, and the right tool from our toolbox is OmniModelForMultiLabelSequenceClassification.

With this clear understanding, we can now proceed to prepare our data specifically for this task.

Step-by-Step Guide: Preparing the DeepSEA Dataset

Now we move from theory to practice. This section will guide you through the hands-on process of preparing the data.

Environment Setup

First, let's install the required Python packages.

pip install -U omnigenbench

Next, we import the libraries we just installed. This gives us the tools for data processing, deep learning, and interacting with the operating system.

A key part of this setup is determining the best available hardware for training. Our script will automatically prioritize a CUDA-enabled GPU if one is available, as this can accelerate training by 10-100x compared to a CPU.

from omnigenbench import (
    OmniTokenizer,
    OmniDatasetForMultiLabelClassification,
)

Understanding OmniGenBench Data Templates

Before diving into data loading, it's crucial to understand how OmniGenBench expects data to be organized. This section explains the standardized data templates and directory structures that make the framework so powerful and consistent.

Standard Directory Structure

OmniGenBench follows a conventional directory structure that enables automatic data discovery and loading:

dataset_directory/
├── train.jsonl              # Training data (required)
├── valid.jsonl              # Validation data (recommended)
├── test.jsonl               # Test data (optional)
├── config.py                # Metadata file (optional)
└── README.md                # Documentation (recommended)

Note

The dataset file type can also be such as CSV/TSV, parquet, or other formats, in which case the files would be named train.csv, valid.csv, and test.csv. The config.py file is optional but highly recommended. It allows you to define important metadata about your dataset, such as the number of labels, label names, and sequence length. This information helps the framework understand how to process your data correctly.

Key Benefits of This Structure:

Automatic Discovery: Framework can find and load data without manual path specification
Consistent Splits: Standard train/valid/test division across all datasets
Metadata Integration: Additional biological annotations stored separately
Documentation: Self-documenting datasets for reproducible research

Data File Formats

JSONL Format (Recommended)CSV Format

Each line contains a JSON object with the sequence and its annotations:

{"sequence": "ATCGATCG...", "label": [1,0,1,0,1,...], "id": "seq_001"}
{"sequence": "GCTAGCTA...", "label": [0,1,0,1,0,...], "id": "seq_002"}

Tabular format with columns for sequence, labels, and metadata:

id,sequence,label,chromosome,start,end
seq_001,ATCGATCG...,"1,0,1,0,1",chr1,1000,1500
seq_002,GCTAGCTA...,"0,1,0,1,0",chr2,2000,2500

Label Format for Multi-Label Classification

For TFB prediction, labels represent binding status for multiple transcription factors:

Array Format (JSONL)String Format (CSV)

"label": [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]

"1,0,1,0,1,0,0,1,0,1"

Where each position corresponds to a specific transcription factor:

Position 0: CTCF (1 = binds, 0 = doesn't bind)
Position 1: p53 (0 = doesn't bind)
Position 2: FOXA1 (1 = binds)
And so on...

Configuration Files

A Python file defining dataset metadata and parameters:

{
  "task_type": "multi_label_classification",
  "sequence_type": "DNA",
  "num_labels": 919,
  "label_names": ["CTCF", "p53", "FOXA1", ...],
  "sequence_length": 1000,
  "total_samples": 440000,
  "description": "DeepSEA dataset for TF binding prediction"
}

Key Metadata Fields:

task_type: Defines the ML task (classification, regression, etc.)
sequence_type: DNA, RNA, or Protein
num_labels: Number of target labels (919 for DeepSEA)
label_names: Human-readable names for each label
sequence_length: Fixed length of input sequences

Loading Strategy Options

OmniGenBench provides multiple loading strategies based on your data format:

From Hugging Face Hub (automatic)From local directory (JSONL files)From local directory (CSV files)

dataset = OmniDatasetForMultiLabelClassification.from_hub(
    "deepsea_tfb_prediction",
    tokenizer=tokenizer
)

dataset = OmniDatasetForMultiLabelClassification(
    "./my_tfb_dataset/",
    tokenizer=tokenizer
)

dataset = OmniDatasetForMultiLabelClassification(
    "tfb_data.csv",
    tokenizer=tokenizer,
    sequence_column="sequence",
    label_column="label"
)

Best Practices

Consistent Naming: Use descriptive, consistent file and column names
Balanced Splits: Ensure train/valid/test splits are representative
Quality Control: Validate data integrity before training
Documentation: Include README with dataset description and usage
Biological Context: Preserve important biological metadata

With this foundation understanding, let's now proceed to configure our specific dataset parameters.

config_or_model = "yangheng/OmniGenome-52M"
dataset_name = "deepsea_tfb_prediction"

# Load tokenizer and datasets using enhanced OmniDataset - matches complete tutorial
print("🔄 Loading tokenizer...")
tokenizer = OmniTokenizer.from_pretrained(config_or_model)
print(f"✅ Tokenizer loaded: {config_or_model}")

Data Acquisition

With our environment configured, it's time to download the DeepSEA dataset. The function below automates this process by:

Checking if the data already exists.
Downloading the dataset from the specified URL if needed.
Extracting the files.
Cleaning up the temporary zip file.

This ensures we have the train.jsonl, valid.jsonl, and test.jsonl files ready for the next stage.

print("📊 Loading DeepSEA TFB dataset...")
datasets = OmniDatasetForMultiLabelClassification.from_hub(
    dataset_name_or_path=dataset_name,
    tokenizer=tokenizer,
    max_length=512,
    max_examples=1000,  # For quick testing; set to None for full dataset 440M examples
    force_padding=False  # The sequence length is fixed, so no need to pad sequence and labels
)

print("📝 Data loading completed! Using  OmniDataset framework.")
print(f"📊 Loaded datasets: {list(datasets.keys())}")
for split, dataset in datasets.items():
    print(f"  - {split}: {len(dataset)} samples")

# Demonstrate dataset functionality - enhanced version
print("🧪 Dataset Analysis and Validation")
print("=" * 40)

# Sample a few examples from the training set
sample_size = 3
train_samples = [datasets['train'][i] for i in range(min(sample_size, len(datasets['train'])))]

for i, sample in enumerate(train_samples):
    print(f"\n📋 Sample {i+1}:")
    print(f"   🧬 Input shape: {sample['input_ids'].shape}")
    print(f"   🏷️ Labels shape: {sample['labels'].shape}")
    print(f"   📊 Positive labels: {sample['labels'].sum().item()}/{len(sample['labels'])} TFs")

    # Show first few nucleotides
    sequence_ids = sample['input_ids'][:20].tolist()
    print(f"   🔤 First 20 tokens: {sequence_ids}")

print(f"\n✅ Dataset validation completed!")
print(f"   📊 All samples have consistent shapes")
print(f"   🧬 Ready for multi-label TFB prediction")
print(f"   🎯 {datasets['train'][0]['labels'].shape[0]} TF labels per sequence")

Custom Dataset and Data Loaders

Now that we have the data files, we need a way to load them into our model efficiently. We'll do this in two parts:

For most of the classification and regression tasks, the dataset have been integrated in OmniGenBench, e.g.,

Dataset Class	Task Type	Description
OmniDatasetForSequenceClassification	Sequence Classification	Tasks for classifying the entire sequence into one category (e.g., promoter vs. non-promoter).
OmniDatasetForMultiLabelClassification	Multi-Label Classification	Tasks for predicting multiple labels for a sequence (e.g., TF binding prediction).
OmniDatasetForSequenceRegression	Sequence Regression	Tasks for predicting a single continuous value for the entire sequence (e.g., translation efficiency).
OmniDatasetForTokenClassification	Token (Base) Classification	Tasks for assigning a label to each token (base) in the sequence (e.g., identifying splice sites).
OmniDatasetForTokenRegression	Token (Base) Regression	Tasks for predicting a continuous value for each token (base) in the sequence.

To demonstrate how to customize a dataset, we here define a dataset for the deepsea dataset in the following cell.

The DeepSEADataset Class

This custom class acts as a bridge between our raw .jsonl files and the PyTorch ecosystem. It inherits from OmniDataset and tells our framework how to process each data entry. Specifically, it:

Processes a DNA sequence and its labels via prepare_input().
Truncates or pads the sequence to a fixed length (MAX_LENGTH). This is crucial because language models require inputs of a consistent size.
Selects the specific TF labels we want to train on.
Tokenizes the sequence, converting the string of "A, C, G, T" into numerical tokens that the model can understand.
Formats the output as PyTorch tensors.

code???

Creating DataLoaders Once the DeepSEADataset is defined, we use PyTorch's DataLoader to create an efficient pipeline. The DataLoader is responsible for:

Batching: Grouping individual samples into batches (BATCH_SIZE).
Shuffling: Randomly shuffling the training data each epoch to improve generalization.
Parallelism: Loading data in the background so it's ready for the model when needed.
```
code???
```

print("📝 Data loading completed! Using enhanced OmniDataset framework.")
print(f"📊 Loaded datasets: {list(datasets.keys())}")
for split, dataset in datasets.items():
    print(f"  - {split}: {len(dataset)} samples")

The datasets are now ready for training! Each dataset contains:

Tokenized DNA sequences
Multi-hot encoded labels for 919 TF binding sites
Proper batching and data loading handled automatically

Enhanced Data Pipeline

With the enhanced OmniGenBench framework, DataLoaders are automatically created and optimized. The framework handles all the complexities:

Automatic Batching: Optimal batch sizes for your hardware configuration
Intelligent Shuffling: Proper data shuffling for better training dynamics
Memory Optimization: Efficient memory usage for large genomic datasets
Multi-Label Handling: Proper formatting for multi-label classification tasks
Hardware Acceleration: Automatic GPU optimization when available

This means you can focus on the biological insights rather than the technical implementation details.

print("\n🎉 Enhanced data pipeline ready!")
print("The datasets are now fully prepared and optimized for training.")
print("DataLoaders will be automatically created during the training process.")

# Let's verify our data is properly formatted
sample_data = datasets['train'][0]
print(f"\n📋 Sample data structure:")
print(f"  - Input IDs shape: {sample_data['input_ids'].shape}")
print(f"  - Labels shape: {sample_data['labels'].shape}")
print(f"  - Number of TF binding sites: {sample_data['labels'].sum().item()}/{len(sample_data['labels'])}")

Summary and Next Steps

Congratulations! You have successfully built a complete data preparation pipeline. You've learned not only how to code a data pipeline but also why it's structured the way it is, starting from the biological question itself.

Your data is now ready to be used for training a model.

In the next tutorial, Model Initialization, we will take these Datasets and learn how to set up a pre-trained Genomic Foundation Model for our TFB prediction task.