✨ TL;DR
Sessa is a new sequence model that places attention inside a recurrent feedback path, enabling power-law memory decay instead of exponential or 1/length dilution. This architecture achieves superior long-context performance while remaining competitive on short sequences.
Current sequence models face fundamental trade-offs in how they handle long-range dependencies. Transformers use self-attention to retrieve from context, but when attention is diffuse (not sharply focused), each token's influence dilutes as O(1/length), making old tokens increasingly weak. State-space models like Mamba process sequences recurrently with input-dependent feedback, but their long-range sensitivity decays exponentially with lag when they cannot sustain "freeze time" over long intervals. Existing architectures are thus limited to either single-read retrieval from the past (Transformers) or single-path information propagation (SSMs). Neither approach provides both flexible selective retrieval and efficient long-range memory that decays slower than 1/length for diffuse attention patterns.
Sessa introduces a novel architecture that embeds attention mechanisms inside a recurrent feedback path, creating a "many-path aggregation" system within each layer. This design combines the selective retrieval capabilities of attention with the recurrent processing of state-space models. The key innovation is that instead of choosing between attention-based retrieval or recurrent propagation, Sessa uses attention to modulate and enrich the recurrent state updates themselves. The architecture is designed to achieve power-law memory decay of order O(ℓ^(-β)) for 0<β<1, which is asymptotically slower than the O(1/ℓ) dilution in Transformers. The authors prove this rate is tight in explicit diffuse uniform-routing settings where influence scales as Θ(ℓ^(-β)). Under stated assumptions, Sessa is the only model class among those compared that can realize both flexible selective retrieval and non-decaying memory profiles.