TFB Prediction Tutorial 1/4: From Biological Questions to Data Pipelines
Welcome to the first tutorial in our four-part series. This guide focuses on the foundational and most critical step of any computational biology project: understanding and preparing your data.
Prerequisite
If you're new to OmniGenBench or foundation models, please read the Fundamental Concepts Tutorial first. It covers language model principles, model-task mapping, genomic data types, and PlantRNA-FM basics.
Before we write any code, we must first understand the landscape of biological data and how to frame biological questions as machine learning tasks.
The Language of Life: DNA and RNA
At its core, genomics is the study of sequences. The primary types are:
- DNA (Deoxyribonucleic acid): The blueprint of life, composed of four bases (A, T, C, G). It carries the genetic instructions for the development, functioning, growth, and reproduction of all known organisms.
- RNA (Ribonucleic acid): Often a messenger, RNA (composed of A, U, C, G) plays a crucial role in translating the instructions from DNA into proteins. It can also have structural and regulatory roles.
These sequences are not random strings, and they contain complex patterns, grammar, and syntax. Our goal is to teach a machine to read and understand this "language of life."
Framing Biological Questions as ML Tasks
A biological question must be translated into a well-defined machine learning task. Here are the most common types:
| Task Type | Biological Question | Example |
|---|---|---|
| Sequence Classification | Does this sequence have property X? | Is this sequence a promoter? (Yes/No) |
| Multi-Label Classification | Which properties apply? | Which of 100 TFs bind this sequence? |
| Sequence Regression | Predict a numerical value | Translation efficiency score |
| Token Classification | Label each base | Identify gene boundaries |
| Sequence-to-Sequence | Transform sequence | DNA → protein sequence |
The OmniGenBench Toolbox: Available Models for Every Task
OmniGenBench provides a suite of pre-configured models, each designed for a specific task. This saves you from having to build them from scratch.
| Task | OmniGenBench Model | When to Use |
|---|---|---|
| Sequence Classification | OmniModelForSequenceClassification |
One label per sequence |
| Multi-Label Classification | OmniModelForMultiLabelSequenceClassification |
Multiple labels per sequence (TFB prediction) |
| Sequence Regression | OmniModelForSequenceRegression |
Predict a continuous value |
| Token Classification | OmniModelForTokenClassification |
Label each nucleotide |
| Token Regression | OmniModelForTokenRegression |
Predict numbers per nucleotide |
| Seq2Seq | OmniModelForSeq2Seq |
Generate sequence outputs |
By understanding this mapping, you can quickly select the right tool for your biological problem.
Our Task: Why TFB is Multi-Label Sequence Classification
Now, let's apply this to our tutorial's goal: Transcription Factor Binding (TFB) Prediction.
- It's a Sequence task: Our input is a DNA sequence.
- It's a Classification task: For each transcription factor, we are asking a "Yes/No" question: does it bind?
- It's a Multi-Label task: A single DNA sequence can be a binding site for multiple different TFs simultaneously. We aren't just picking one from a list; we are identifying all potential binders.
Therefore, the correct framing is Multi-Label Sequence Classification, and the right tool from our toolbox is OmniModelForMultiLabelSequenceClassification.
With this clear understanding, we can now proceed to prepare our data specifically for this task.
Step-by-Step Guide: Preparing the DeepSEA Dataset
Now we move from theory to practice. This section will guide you through the hands-on process of preparing the data.
Environment Setup
First, let's install the required Python packages.
Next, we import the libraries we just installed. This gives us the tools for data processing, deep learning, and interacting with the operating system.A key part of this setup is determining the best available hardware for training. Our script will automatically prioritize a CUDA-enabled GPU if one is available, as this can accelerate training by 10-100x compared to a CPU.
Understanding OmniGenBench Data Templates
Before diving into data loading, it's crucial to understand how OmniGenBench expects data to be organized. This section explains the standardized data templates and directory structures that make the framework so powerful and consistent.
Standard Directory Structure
OmniGenBench follows a conventional directory structure that enables automatic data discovery and loading:
dataset_directory/
├── train.jsonl # Training data (required)
├── valid.jsonl # Validation data (recommended)
├── test.jsonl # Test data (optional)
├── config.py # Metadata file (optional)
└── README.md # Documentation (recommended)
Note
The dataset file type can also be such as CSV/TSV, parquet, or other formats, in which case the files would be named train.csv, valid.csv, and test.csv. The config.py file is optional but highly recommended. It allows you to define important metadata about your dataset, such as the number of labels, label names, and sequence length. This information helps the framework understand how to process your data correctly.
Key Benefits of This Structure:
- Automatic Discovery: Framework can find and load data without manual path specification
- Consistent Splits: Standard train/valid/test division across all datasets
- Metadata Integration: Additional biological annotations stored separately
- Documentation: Self-documenting datasets for reproducible research
Data File Formats
Each line contains a JSON object with the sequence and its annotations:
Label Format for Multi-Label Classification
For TFB prediction, labels represent binding status for multiple transcription factors:
Where each position corresponds to a specific transcription factor:
- Position 0: CTCF (1 = binds, 0 = doesn't bind)
- Position 1: p53 (0 = doesn't bind)
- Position 2: FOXA1 (1 = binds)
- And so on...
Configuration Files
A Python file defining dataset metadata and parameters:
{
"task_type": "multi_label_classification",
"sequence_type": "DNA",
"num_labels": 919,
"label_names": ["CTCF", "p53", "FOXA1", ...],
"sequence_length": 1000,
"total_samples": 440000,
"description": "DeepSEA dataset for TF binding prediction"
}
- task_type: Defines the ML task (classification, regression, etc.)
- sequence_type: DNA, RNA, or Protein
- num_labels: Number of target labels (919 for DeepSEA)
- label_names: Human-readable names for each label
- sequence_length: Fixed length of input sequences
Loading Strategy Options
OmniGenBench provides multiple loading strategies based on your data format:
Best Practices
- Consistent Naming: Use descriptive, consistent file and column names
- Balanced Splits: Ensure train/valid/test splits are representative
- Quality Control: Validate data integrity before training
- Documentation: Include README with dataset description and usage
- Biological Context: Preserve important biological metadata
With this foundation understanding, let's now proceed to configure our specific dataset parameters.
config_or_model = "yangheng/OmniGenome-52M"
dataset_name = "deepsea_tfb_prediction"
# Load tokenizer and datasets using enhanced OmniDataset - matches complete tutorial
print("🔄 Loading tokenizer...")
tokenizer = OmniTokenizer.from_pretrained(config_or_model)
print(f"✅ Tokenizer loaded: {config_or_model}")
Data Acquisition
With our environment configured, it's time to download the DeepSEA dataset. The function below automates this process by:
- Checking if the data already exists.
- Downloading the dataset from the specified URL if needed.
- Extracting the files.
- Cleaning up the temporary zip file.
This ensures we have the train.jsonl, valid.jsonl, and test.jsonl files ready for the next stage.
print("📊 Loading DeepSEA TFB dataset...")
datasets = OmniDatasetForMultiLabelClassification.from_hub(
dataset_name_or_path=dataset_name,
tokenizer=tokenizer,
max_length=512,
max_examples=1000, # For quick testing; set to None for full dataset 440M examples
force_padding=False # The sequence length is fixed, so no need to pad sequence and labels
)
print("📝 Data loading completed! Using OmniDataset framework.")
print(f"📊 Loaded datasets: {list(datasets.keys())}")
for split, dataset in datasets.items():
print(f" - {split}: {len(dataset)} samples")
# Demonstrate dataset functionality - enhanced version
print("🧪 Dataset Analysis and Validation")
print("=" * 40)
# Sample a few examples from the training set
sample_size = 3
train_samples = [datasets['train'][i] for i in range(min(sample_size, len(datasets['train'])))]
for i, sample in enumerate(train_samples):
print(f"\n📋 Sample {i+1}:")
print(f" 🧬 Input shape: {sample['input_ids'].shape}")
print(f" 🏷️ Labels shape: {sample['labels'].shape}")
print(f" 📊 Positive labels: {sample['labels'].sum().item()}/{len(sample['labels'])} TFs")
# Show first few nucleotides
sequence_ids = sample['input_ids'][:20].tolist()
print(f" 🔤 First 20 tokens: {sequence_ids}")
print(f"\n✅ Dataset validation completed!")
print(f" 📊 All samples have consistent shapes")
print(f" 🧬 Ready for multi-label TFB prediction")
print(f" 🎯 {datasets['train'][0]['labels'].shape[0]} TF labels per sequence")
Custom Dataset and Data Loaders
Now that we have the data files, we need a way to load them into our model efficiently. We'll do this in two parts:
For most of the classification and regression tasks, the dataset have been integrated in OmniGenBench, e.g.,
| Dataset Class | Task Type | Description |
|---|---|---|
| OmniDatasetForSequenceClassification | Sequence Classification | Tasks for classifying the entire sequence into one category (e.g., promoter vs. non-promoter). |
| OmniDatasetForMultiLabelClassification | Multi-Label Classification | Tasks for predicting multiple labels for a sequence (e.g., TF binding prediction). |
| OmniDatasetForSequenceRegression | Sequence Regression | Tasks for predicting a single continuous value for the entire sequence (e.g., translation efficiency). |
| OmniDatasetForTokenClassification | Token (Base) Classification | Tasks for assigning a label to each token (base) in the sequence (e.g., identifying splice sites). |
| OmniDatasetForTokenRegression | Token (Base) Regression | Tasks for predicting a continuous value for each token (base) in the sequence. |
To demonstrate how to customize a dataset, we here define a dataset for the deepsea dataset in the following cell.
The DeepSEADataset Class
This custom class acts as a bridge between our raw .jsonl files and the PyTorch ecosystem. It inherits from OmniDataset and tells our framework how to process each data entry. Specifically, it:
- Processes a DNA sequence and its labels via prepare_input().
- Truncates or pads the sequence to a fixed length (
MAX_LENGTH). This is crucial because language models require inputs of a consistent size. - Selects the specific TF labels we want to train on.
- Tokenizes the sequence, converting the string of "A, C, G, T" into numerical tokens that the model can understand.
- Formats the output as PyTorch tensors.
Creating DataLoaders
Once the DeepSEADataset is defined, we use PyTorch's DataLoader to create an efficient pipeline. The DataLoader is responsible for:
- Batching: Grouping individual samples into batches (
BATCH_SIZE). - Shuffling: Randomly shuffling the training data each epoch to improve generalization.
- Parallelism: Loading data in the background so it's ready for the model when needed.
print("📝 Data loading completed! Using enhanced OmniDataset framework.")
print(f"📊 Loaded datasets: {list(datasets.keys())}")
for split, dataset in datasets.items():
print(f" - {split}: {len(dataset)} samples")
The datasets are now ready for training! Each dataset contains:
- Tokenized DNA sequences
- Multi-hot encoded labels for 919 TF binding sites
- Proper batching and data loading handled automatically
Enhanced Data Pipeline
With the enhanced OmniGenBench framework, DataLoaders are automatically created and optimized. The framework handles all the complexities:
- Automatic Batching: Optimal batch sizes for your hardware configuration
- Intelligent Shuffling: Proper data shuffling for better training dynamics
- Memory Optimization: Efficient memory usage for large genomic datasets
- Multi-Label Handling: Proper formatting for multi-label classification tasks
- Hardware Acceleration: Automatic GPU optimization when available
This means you can focus on the biological insights rather than the technical implementation details.
print("\n🎉 Enhanced data pipeline ready!")
print("The datasets are now fully prepared and optimized for training.")
print("DataLoaders will be automatically created during the training process.")
# Let's verify our data is properly formatted
sample_data = datasets['train'][0]
print(f"\n📋 Sample data structure:")
print(f" - Input IDs shape: {sample_data['input_ids'].shape}")
print(f" - Labels shape: {sample_data['labels'].shape}")
print(f" - Number of TF binding sites: {sample_data['labels'].sum().item()}/{len(sample_data['labels'])}")
Summary and Next Steps
Congratulations! You have successfully built a complete data preparation pipeline. You've learned not only how to code a data pipeline but also why it's structured the way it is, starting from the biological question itself.
Your data is now ready to be used for training a model.
In the next tutorial, Model Initialization, we will take these Datasets and learn how to set up a pre-trained Genomic Foundation Model for our TFB prediction task.