✨ TL;DR
MS-RCGR is a new method that converts biological sequences (DNA/protein) into multi-resolution geometric images without losing information, enabling better classification through traditional ML, computer vision, or hybrid approaches. The method consistently improves performance across different analysis paradigms and achieves best results when combined with protein language models.
Biological sequence classification faces a fundamental challenge in balancing performance with interpretability. Traditional sequence encoding methods often lose information during transformation or fail to capture patterns at multiple scales. Existing approaches typically operate within a single analytical paradigm—either traditional machine learning with hand-crafted features, deep learning on raw sequences, or computer vision on sequence representations—limiting their flexibility and potentially missing complementary insights. There is a need for a unified framework that can preserve complete sequence information while enabling diverse analytical approaches and providing interpretable representations.
The paper introduces Multi-Scale Reversible Chaos Game Representation (MS-RCGR), which transforms biological sequences into multi-resolution geometric representations using rational arithmetic and hierarchical k-mer decomposition. The method generates scale-invariant features through Chaos Game Representation while guaranteeing complete reversibility, meaning the original sequence can be perfectly reconstructed from the encoding. MS-RCGR creates geometric features at multiple scales, capturing patterns from individual nucleotides to complex motif structures. The framework supports three distinct analytical paradigms: traditional machine learning using extracted geometric features from the CGR representation, computer vision models that treat CGR outputs as images, and hybrid approaches that combine protein language model embeddings (ESM2, ProtT5) with MS-RCGR features. This multi-paradigm design allows researchers to choose the most appropriate analytical approach for their specific task.