Skip to content

TFB Prediction Tutorial 1/4: From Biological Questions to Data Pipelines

Welcome to the first tutorial in our four-part series. This guide focuses on the foundational and most critical step of any computational biology project: understanding and preparing your data.

Prerequisite

If you're new to OmniGenBench or foundation models, please read the Fundamental Concepts Tutorial first. It covers language model principles, model-task mapping, genomic data types, and PlantRNA-FM basics.

Before we write any code, we must first understand the landscape of biological data and how to frame biological questions as machine learning tasks.

The Language of Life: DNA and RNA

At its core, genomics is the study of sequences. The primary types are:

  • DNA (Deoxyribonucleic acid): The blueprint of life, composed of four bases (A, T, C, G). It carries the genetic instructions for the development, functioning, growth, and reproduction of all known organisms.
  • RNA (Ribonucleic acid): Often a messenger, RNA (composed of A, U, C, G) plays a crucial role in translating the instructions from DNA into proteins. It can also have structural and regulatory roles.

These sequences are not random strings, and they contain complex patterns, grammar, and syntax. Our goal is to teach a machine to read and understand this "language of life."


Framing Biological Questions as ML Tasks

A biological question must be translated into a well-defined machine learning task. Here are the most common types:

Task Type Biological Question Example
Sequence Classification Does this sequence have property X? Is this sequence a promoter? (Yes/No)
Multi-Label Classification Which properties apply? Which of 100 TFs bind this sequence?
Sequence Regression Predict a numerical value Translation efficiency score
Token Classification Label each base Identify gene boundaries
Sequence-to-Sequence Transform sequence DNA → protein sequence

The OmniGenBench Toolbox: Available Models for Every Task

OmniGenBench provides a suite of pre-configured models, each designed for a specific task. This saves you from having to build them from scratch.

Task OmniGenBench Model When to Use
Sequence Classification OmniModelForSequenceClassification One label per sequence
Multi-Label Classification OmniModelForMultiLabelSequenceClassification Multiple labels per sequence (TFB prediction)
Sequence Regression OmniModelForSequenceRegression Predict a continuous value
Token Classification OmniModelForTokenClassification Label each nucleotide
Token Regression OmniModelForTokenRegression Predict numbers per nucleotide
Seq2Seq OmniModelForSeq2Seq Generate sequence outputs

By understanding this mapping, you can quickly select the right tool for your biological problem.

Our Task: Why TFB is Multi-Label Sequence Classification

Now, let's apply this to our tutorial's goal: Transcription Factor Binding (TFB) Prediction.

  • It's a Sequence task: Our input is a DNA sequence.
  • It's a Classification task: For each transcription factor, we are asking a "Yes/No" question: does it bind?
  • It's a Multi-Label task: A single DNA sequence can be a binding site for multiple different TFs simultaneously. We aren't just picking one from a list; we are identifying all potential binders.

Therefore, the correct framing is Multi-Label Sequence Classification, and the right tool from our toolbox is OmniModelForMultiLabelSequenceClassification.

With this clear understanding, we can now proceed to prepare our data specifically for this task.


Step-by-Step Guide: Preparing the DeepSEA Dataset

Now we move from theory to practice. This section will guide you through the hands-on process of preparing the data.


Environment Setup

First, let's install the required Python packages.

pip install -U omnigenbench
Next, we import the libraries we just installed. This gives us the tools for data processing, deep learning, and interacting with the operating system.

A key part of this setup is determining the best available hardware for training. Our script will automatically prioritize a CUDA-enabled GPU if one is available, as this can accelerate training by 10-100x compared to a CPU.

from omnigenbench import (
    OmniTokenizer,
    OmniDatasetForMultiLabelClassification,
)


Understanding OmniGenBench Data Templates

Before diving into data loading, it's crucial to understand how OmniGenBench expects data to be organized. This section explains the standardized data templates and directory structures that make the framework so powerful and consistent.

Standard Directory Structure

OmniGenBench follows a conventional directory structure that enables automatic data discovery and loading:

dataset_directory/
├── train.jsonl              # Training data (required)
├── valid.jsonl              # Validation data (recommended)
├── test.jsonl               # Test data (optional)
├── config.py                # Metadata file (optional)
└── README.md                # Documentation (recommended)

Note

The dataset file type can also be such as CSV/TSV, parquet, or other formats, in which case the files would be named train.csv, valid.csv, and test.csv. The config.py file is optional but highly recommended. It allows you to define important metadata about your dataset, such as the number of labels, label names, and sequence length. This information helps the framework understand how to process your data correctly.

Key Benefits of This Structure:

  • Automatic Discovery: Framework can find and load data without manual path specification
  • Consistent Splits: Standard train/valid/test division across all datasets
  • Metadata Integration: Additional biological annotations stored separately
  • Documentation: Self-documenting datasets for reproducible research

Data File Formats

Each line contains a JSON object with the sequence and its annotations:

{"sequence": "ATCGATCG...", "label": [1,0,1,0,1,...], "id": "seq_001"}
{"sequence": "GCTAGCTA...", "label": [0,1,0,1,0,...], "id": "seq_002"}

Tabular format with columns for sequence, labels, and metadata:

id,sequence,label,chromosome,start,end
seq_001,ATCGATCG...,"1,0,1,0,1",chr1,1000,1500
seq_002,GCTAGCTA...,"0,1,0,1,0",chr2,2000,2500

Label Format for Multi-Label Classification

For TFB prediction, labels represent binding status for multiple transcription factors:

"label": [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]
"1,0,1,0,1,0,0,1,0,1"

Where each position corresponds to a specific transcription factor:

  • Position 0: CTCF (1 = binds, 0 = doesn't bind)
  • Position 1: p53 (0 = doesn't bind)
  • Position 2: FOXA1 (1 = binds)
  • And so on...

Configuration Files

A Python file defining dataset metadata and parameters:

{
  "task_type": "multi_label_classification",
  "sequence_type": "DNA",
  "num_labels": 919,
  "label_names": ["CTCF", "p53", "FOXA1", ...],
  "sequence_length": 1000,
  "total_samples": 440000,
  "description": "DeepSEA dataset for TF binding prediction"
}
Key Metadata Fields:

  • task_type: Defines the ML task (classification, regression, etc.)
  • sequence_type: DNA, RNA, or Protein
  • num_labels: Number of target labels (919 for DeepSEA)
  • label_names: Human-readable names for each label
  • sequence_length: Fixed length of input sequences

Loading Strategy Options

OmniGenBench provides multiple loading strategies based on your data format:

dataset = OmniDatasetForMultiLabelClassification.from_hub(
    "deepsea_tfb_prediction",
    tokenizer=tokenizer
)
dataset = OmniDatasetForMultiLabelClassification(
    "./my_tfb_dataset/",
    tokenizer=tokenizer
)
dataset = OmniDatasetForMultiLabelClassification(
    "tfb_data.csv",
    tokenizer=tokenizer,
    sequence_column="sequence",
    label_column="label"
)

Best Practices

  • Consistent Naming: Use descriptive, consistent file and column names
  • Balanced Splits: Ensure train/valid/test splits are representative
  • Quality Control: Validate data integrity before training
  • Documentation: Include README with dataset description and usage
  • Biological Context: Preserve important biological metadata

With this foundation understanding, let's now proceed to configure our specific dataset parameters.

config_or_model = "yangheng/OmniGenome-52M"
dataset_name = "deepsea_tfb_prediction"

# Load tokenizer and datasets using enhanced OmniDataset - matches complete tutorial
print("🔄 Loading tokenizer...")
tokenizer = OmniTokenizer.from_pretrained(config_or_model)
print(f"✅ Tokenizer loaded: {config_or_model}")

Data Acquisition

With our environment configured, it's time to download the DeepSEA dataset. The function below automates this process by:

  1. Checking if the data already exists.
  2. Downloading the dataset from the specified URL if needed.
  3. Extracting the files.
  4. Cleaning up the temporary zip file.

This ensures we have the train.jsonl, valid.jsonl, and test.jsonl files ready for the next stage.

print("📊 Loading DeepSEA TFB dataset...")
datasets = OmniDatasetForMultiLabelClassification.from_hub(
    dataset_name_or_path=dataset_name,
    tokenizer=tokenizer,
    max_length=512,
    max_examples=1000,  # For quick testing; set to None for full dataset 440M examples
    force_padding=False  # The sequence length is fixed, so no need to pad sequence and labels
)

print("📝 Data loading completed! Using  OmniDataset framework.")
print(f"📊 Loaded datasets: {list(datasets.keys())}")
for split, dataset in datasets.items():
    print(f"  - {split}: {len(dataset)} samples")

# Demonstrate dataset functionality - enhanced version
print("🧪 Dataset Analysis and Validation")
print("=" * 40)

# Sample a few examples from the training set
sample_size = 3
train_samples = [datasets['train'][i] for i in range(min(sample_size, len(datasets['train'])))]

for i, sample in enumerate(train_samples):
    print(f"\n📋 Sample {i+1}:")
    print(f"   🧬 Input shape: {sample['input_ids'].shape}")
    print(f"   🏷️ Labels shape: {sample['labels'].shape}")
    print(f"   📊 Positive labels: {sample['labels'].sum().item()}/{len(sample['labels'])} TFs")

    # Show first few nucleotides
    sequence_ids = sample['input_ids'][:20].tolist()
    print(f"   🔤 First 20 tokens: {sequence_ids}")

print(f"\n✅ Dataset validation completed!")
print(f"   📊 All samples have consistent shapes")
print(f"   🧬 Ready for multi-label TFB prediction")
print(f"   🎯 {datasets['train'][0]['labels'].shape[0]} TF labels per sequence")

Custom Dataset and Data Loaders

Now that we have the data files, we need a way to load them into our model efficiently. We'll do this in two parts:

For most of the classification and regression tasks, the dataset have been integrated in OmniGenBench, e.g.,

Dataset Class Task Type Description
OmniDatasetForSequenceClassification Sequence Classification Tasks for classifying the entire sequence into one category (e.g., promoter vs. non-promoter).
OmniDatasetForMultiLabelClassification Multi-Label Classification Tasks for predicting multiple labels for a sequence (e.g., TF binding prediction).
OmniDatasetForSequenceRegression Sequence Regression Tasks for predicting a single continuous value for the entire sequence (e.g., translation efficiency).
OmniDatasetForTokenClassification Token (Base) Classification Tasks for assigning a label to each token (base) in the sequence (e.g., identifying splice sites).
OmniDatasetForTokenRegression Token (Base) Regression Tasks for predicting a continuous value for each token (base) in the sequence.

To demonstrate how to customize a dataset, we here define a dataset for the deepsea dataset in the following cell.

The DeepSEADataset Class

This custom class acts as a bridge between our raw .jsonl files and the PyTorch ecosystem. It inherits from OmniDataset and tells our framework how to process each data entry. Specifically, it:

  1. Processes a DNA sequence and its labels via prepare_input().
  2. Truncates or pads the sequence to a fixed length (MAX_LENGTH). This is crucial because language models require inputs of a consistent size.
  3. Selects the specific TF labels we want to train on.
  4. Tokenizes the sequence, converting the string of "A, C, G, T" into numerical tokens that the model can understand.
  5. Formats the output as PyTorch tensors.
code???

Creating DataLoaders Once the DeepSEADataset is defined, we use PyTorch's DataLoader to create an efficient pipeline. The DataLoader is responsible for:

  1. Batching: Grouping individual samples into batches (BATCH_SIZE).
  2. Shuffling: Randomly shuffling the training data each epoch to improve generalization.
  3. Parallelism: Loading data in the background so it's ready for the model when needed.
    code???
    
print("📝 Data loading completed! Using enhanced OmniDataset framework.")
print(f"📊 Loaded datasets: {list(datasets.keys())}")
for split, dataset in datasets.items():
    print(f"  - {split}: {len(dataset)} samples")

The datasets are now ready for training! Each dataset contains:

  • Tokenized DNA sequences
  • Multi-hot encoded labels for 919 TF binding sites
  • Proper batching and data loading handled automatically

Enhanced Data Pipeline

With the enhanced OmniGenBench framework, DataLoaders are automatically created and optimized. The framework handles all the complexities:

  1. Automatic Batching: Optimal batch sizes for your hardware configuration
  2. Intelligent Shuffling: Proper data shuffling for better training dynamics
  3. Memory Optimization: Efficient memory usage for large genomic datasets
  4. Multi-Label Handling: Proper formatting for multi-label classification tasks
  5. Hardware Acceleration: Automatic GPU optimization when available

This means you can focus on the biological insights rather than the technical implementation details.

print("\n🎉 Enhanced data pipeline ready!")
print("The datasets are now fully prepared and optimized for training.")
print("DataLoaders will be automatically created during the training process.")

# Let's verify our data is properly formatted
sample_data = datasets['train'][0]
print(f"\n📋 Sample data structure:")
print(f"  - Input IDs shape: {sample_data['input_ids'].shape}")
print(f"  - Labels shape: {sample_data['labels'].shape}")
print(f"  - Number of TF binding sites: {sample_data['labels'].sum().item()}/{len(sample_data['labels'])}")

Summary and Next Steps

Congratulations! You have successfully built a complete data preparation pipeline. You've learned not only how to code a data pipeline but also why it's structured the way it is, starting from the biological question itself.

Your data is now ready to be used for training a model.

In the next tutorial, Model Initialization, we will take these Datasets and learn how to set up a pre-trained Genomic Foundation Model for our TFB prediction task.