Machine Learning Task Classification in Genomics
Understanding how biological problems map to different machine learning (ML) task types is essential for choosing the right models, loss functions, and evaluation metrics. In genomics, four major ML task categories cover most practical applications.
Four Main Task Types
| Task Type | Description | Biological Example | Output Format |
|---|---|---|---|
| Sequence Classification | Assign one or more labels to an entire sequence | Is this DNA sequence a promoter? | Single label or label set |
| Sequence Regression | Predict continuous numerical values for a sequence | Predicting protein stability score (0.0–1.0) | Continuous values |
| Token Classification | Assign a label to each nucleotide/amino acid in the sequence | Identifying transcription factor binding sites | One label per position |
| Sequence-to-Sequence | Transform an input sequence into a different output sequence | RNA sequence → RNA secondary structure | Output sequence |
These categories mirror classical ML tasks but are adapted to the structured, position-dependent nature of biological sequences.
Examples of Common Genomic ML Tasks
Sequence Classification Tasks
- Translation efficiency: High vs Low
- Transcription factor binding: Binding vs Non-binding
- Promoter identification: Promoter vs Background
- Subcellular localization: Nucleus vs Cytoplasm vs Mitochondria
Sequence Regression Tasks
- Gene expression levels (e.g., FPKM prediction)
- Protein stability (ΔΔG value prediction)
- Binding affinity (Kd prediction)
- Enzyme activity (kcat prediction)
Token Classification Tasks
- Transcription factor binding site probability per nucleotide
- Splice site boundary prediction
- RNA modification sites (e.g., m6A position identification)
- Protein secondary structure (α-helix, β-sheet, random coil)
Sequence-to-Sequence Tasks
- RNA secondary structure prediction: Sequence → Structure notation
- Protein design: Function description → Amino acid sequence
- Sequence optimization: Wild-type → Optimized sequence
- Reverse translation: Protein sequence → DNA codon sequence
How to Select the Correct Task Type
1. Ask: What do I want to predict?
Properties of an entire sequence
→ Sequence classification or regression
Examples: promoter vs non-promoter, expression level prediction
Properties of specific positions
→ Token classification
Examples: splice sites, binding motifs
Entirely new sequence as output
→ Sequence-to-sequence
Examples: RNA structure prediction, protein design
2. Consider the Output Type
-
Category labels (Yes/No, High/Low, etc.)
→ Classification -
Continuous numerical values (0.1, 2.5, etc.)
→ Regression -
Multiple labels for the entire sequence
→ Multi-label classification -
One label per nucleotide/residue
→ Token classification
Summary
By classifying biological problems into one of these four ML task types, you can systematically decide:
- Which model architecture is suitable
- What loss function to use
- Which evaluation metrics matter
- How to structure your dataset
This task-aware approach ensures more efficient experimentation and clearer scientific interpretation.