Machine Learning Task Classification in Genomics

Understanding how biological problems map to different machine learning (ML) task types is essential for choosing the right models, loss functions, and evaluation metrics. In genomics, four major ML task categories cover most practical applications.

Four Main Task Types

Task Type	Description	Biological Example	Output Format
Sequence Classification	Assign one or more labels to an entire sequence	Is this DNA sequence a promoter?	Single label or label set
Sequence Regression	Predict continuous numerical values for a sequence	Predicting protein stability score (0.0–1.0)	Continuous values
Token Classification	Assign a label to each nucleotide/amino acid in the sequence	Identifying transcription factor binding sites	One label per position
Sequence-to-Sequence	Transform an input sequence into a different output sequence	RNA sequence → RNA secondary structure	Output sequence

These categories mirror classical ML tasks but are adapted to the structured, position-dependent nature of biological sequences.

Examples of Common Genomic ML Tasks

Sequence Classification Tasks

Translation efficiency: High vs Low
Transcription factor binding: Binding vs Non-binding
Promoter identification: Promoter vs Background
Subcellular localization: Nucleus vs Cytoplasm vs Mitochondria

Sequence Regression Tasks

Gene expression levels (e.g., FPKM prediction)
Protein stability (ΔΔG value prediction)
Binding affinity (Kd prediction)
Enzyme activity (kcat prediction)

Token Classification Tasks

Transcription factor binding site probability per nucleotide
Splice site boundary prediction
RNA modification sites (e.g., m6A position identification)
Protein secondary structure (α-helix, β-sheet, random coil)

Sequence-to-Sequence Tasks

RNA secondary structure prediction: Sequence → Structure notation
Protein design: Function description → Amino acid sequence
Sequence optimization: Wild-type → Optimized sequence
Reverse translation: Protein sequence → DNA codon sequence

How to Select the Correct Task Type

1. Ask: What do I want to predict?

Properties of an entire sequence
→ Sequence classification or regression
Examples: promoter vs non-promoter, expression level prediction

Properties of specific positions
→ Token classification
Examples: splice sites, binding motifs

Entirely new sequence as output
→ Sequence-to-sequence
Examples: RNA structure prediction, protein design

2. Consider the Output Type

Category labels (Yes/No, High/Low, etc.)
→ Classification
Continuous numerical values (0.1, 2.5, etc.)
→ Regression
Multiple labels for the entire sequence
→ Multi-label classification
One label per nucleotide/residue
→ Token classification

Summary

By classifying biological problems into one of these four ML task types, you can systematically decide:

Which model architecture is suitable
What loss function to use
Which evaluation metrics matter
How to structure your dataset

This task-aware approach ensures more efficient experimentation and clearer scientific interpretation.