new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 15

SwinJSCC: Taming Swin Transformer for Deep Joint Source-Channel Coding

As one of the key techniques to realize semantic communications, end-to-end optimized neural joint source-channel coding (JSCC) has made great progress over the past few years. A general trend in many recent works pushing the model adaptability or the application diversity of neural JSCC is based on the convolutional neural network (CNN) backbone, whose model capacity is yet limited, inherently leading to inferior system coding gain against traditional coded transmission systems. In this paper, we establish a new neural JSCC backbone that can also adapt flexibly to diverse channel conditions and transmission rates within a single model, our open-source project aims to promote the research in this field. Specifically, we show that with elaborate design, neural JSCC codec built on the emerging Swin Transformer backbone achieves superior performance than conventional neural JSCC codecs built upon CNN, while also requiring lower end-to-end processing latency. Paired with two spatial modulation modules that scale latent representations based on the channel state information and target transmission rate, our baseline SwinJSCC can further upgrade to a versatile version, which increases its capability to adapt to diverse channel conditions and rate configurations. Extensive experimental results show that our SwinJSCC achieves better or comparable performance versus the state-of-the-art engineered BPG + 5G LDPC coded transmission system with much faster end-to-end coding speed, especially for high-resolution images, in which case traditional CNN-based JSCC yet falls behind due to its limited model capacity.

  • 6 authors
·
Aug 18, 2023

Semantics-Guided Diffusion for Deep Joint Source-Channel Coding in Wireless Image Transmission

Joint source-channel coding (JSCC) offers a promising avenue for enhancing transmission efficiency by jointly incorporating source and channel statistics into the system design. A key advancement in this area is the deep joint source and channel coding (DeepJSCC) technique that designs a direct mapping of input signals to channel symbols parameterized by a neural network, which can be trained for arbitrary channel models and semantic quality metrics. This paper advances the DeepJSCC framework toward a semantics-aligned, high-fidelity transmission approach, called semantics-guided diffusion DeepJSCC (SGD-JSCC). Existing schemes that integrate diffusion models (DMs) with JSCC face challenges in transforming random generation into accurate reconstruction and adapting to varying channel conditions. SGD-JSCC incorporates two key innovations: (1) utilizing some inherent information that contributes to the semantics of an image, such as text description or edge map, to guide the diffusion denoising process; and (2) enabling seamless adaptability to varying channel conditions with the help of a semantics-guided DM for channel denoising. The DM is guided by diverse semantic information and integrates seamlessly with DeepJSCC. In a slow fading channel, SGD-JSCC dynamically adapts to the instantaneous signal-to-noise ratio (SNR) directly estimated from the channel output, thereby eliminating the need for additional pilot transmissions for channel estimation. In a fast fading channel, we introduce a training-free denoising strategy, allowing SGD-JSCC to effectively adjust to fluctuations in channel gains. Numerical results demonstrate that, guided by semantic information and leveraging the powerful DM, our method outperforms existing DeepJSCC schemes, delivering satisfactory reconstruction performance even at extremely poor channel conditions.

  • 6 authors
·
Jan 2, 2025

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.

  • 3 authors
·
Oct 26, 2023

Text-Guided Diffusion Model-based Generative Communication for Wireless Image Transmission

Reliable image transmission over wireless channels is particularly challenging at extremely low transmission rates, where conventional compression and channel coding schemes fail to preserve adequate visual quality. To address this issue, we propose a generative communication framework based on diffusion models, which integrates joint source channel coding (JSCC) with semantic-guided reconstruction leveraging a pre-trained generative model. Unlike conventional architectures that aim to recover exact pixel values of the original image, the proposed method focuses on preserving and reconstructing semantically meaningful visual content under severely constrained rates, ensuring perceptual plausibility and faithfulness to the scene intent. Specifically, the transmitter encodes the source image via JSCC and jointly transmits it with a textual prompt over the wireless channel. At the receiver, the corrupted low-rate representation is fused with the prompt and reconstructed through a Stable Diffusion model with ControlNet, enabling high-quality visual recovery. Leveraging both generative priors and semantic guidance, the proposed framework produces perceptually convincing images even under extreme bandwidth limitations. Experimental results demonstrate that the proposed method consistently outperforms conventional coding-based schemes and deep learning baselines, achieving superior perceptual quality and robustness across various channel conditions.

  • 6 authors
·
Oct 24, 2025

ISCS: Parameter-Guided Channel Ordering and Grouping for Learned Image Compression

Prior studies in learned image compression (LIC) consistently show that only a small subset of latent channels is critical for reconstruction, while many others carry limited information. Exploiting this imbalance could improve both coding and computational efficiency, yet existing approaches often rely on costly, dataset-specific ablation tests and typically analyze channels in isolation, ignoring their interdependencies. We propose a generalizable, dataset-agnostic method to identify and organize important channels in pretrained VAE-based LIC models. Instead of brute-force empirical evaluations, our approach leverages intrinsic parameter statistics-weight variances, bias magnitudes, and pairwise correlations-to estimate channel importance. This analysis reveals a consistent organizational structure, termed the Invariant Salient Channel Space (ISCS), where Salient-Core channels capture dominant structures and Salient-Auxiliary channels provide complementary details. Building on ISCS, we introduce a deterministic channel ordering and grouping strategy that enables slice-parallel decoding, reduces redundancy, and improves bitrate efficiency. Experiments across multiple LIC architectures demonstrate that our method effectively reduces bitrate and computation while maintaining reconstruction quality, providing a practical and modular enhancement to existing learned compression frameworks.

  • 5 authors
·
Sep 20, 2025

BaRISTA: Brain Scale Informed Spatiotemporal Representation of Human Intracranial Neural Activity

Intracranial recordings have opened a unique opportunity to simultaneously measure activity across multiregional networks in the human brain. Recent works have focused on developing transformer-based neurofoundation models of such recordings that can generalize across subjects and datasets. However, these recordings exhibit highly complex spatiotemporal interactions across diverse spatial scales, from the single-channel scale to the scale of brain regions. As such, there remain critical open questions regarding how best to encode spatial information and how to design self-supervision tasks that enable the learning of brain network patterns and enhance downstream decoding performance using such high-dimensional, multiregional recordings. To allow for exploring these questions, we propose a new spatiotemporal transformer model of multiregional neural activity and a corresponding self-supervised masked latent reconstruction task, designed to enable flexibility in the spatial scale used for token encoding and masking. Applying this model on publicly available multiregional intracranial electrophysiology (iEEG) data, we demonstrate that adjusting the spatial scale for both token encoding and masked reconstruction significantly impacts downstream decoding. Further, we find that spatial encoding at larger scales than channel-level encoding, which is commonly used in existing iEEG transformer models, improves downstream decoding performance. Finally, we demonstrate that our method allows for region-level token encoding while also maintaining accurate channel-level neural reconstruction. Taken together, our modeling framework enables exploration of the spatial scales used for token encoding and masking, reveals their importance towards self-supervised pretraining of neurofoundation models of multiregional human brain activity, and enhances downstream decoding performance.

  • 3 authors
·
Dec 12, 2025

Decoding Neural Responses in Mouse Visual Cortex through a Deep Neural Network

Finding a code to unravel the population of neural responses that leads to a distinct animal behavior has been a long-standing question in the field of neuroscience. With the recent advances in machine learning, it is shown that the hierarchically Deep Neural Networks (DNNs) perform optimally in decoding unique features out of complex datasets. In this study, we utilize the power of a DNN to explore the computational principles in the mammalian brain by exploiting the Neuropixel data from Allen Brain Institute. We decode the neural responses from mouse visual cortex to predict the presented stimuli to the animal for natural (bear, trees, cheetah, etc.) and artificial (drifted gratings, orientated bars, etc.) classes. Our results indicate that neurons in mouse visual cortex encode the features of natural and artificial objects in a distinct manner, and such neural code is consistent across animals. We investigate this by applying transfer learning to train a DNN on the neural responses of a single animal and test its generalized performance across multiple animals. Within a single animal, DNN is able to decode the neural responses with as much as 100% classification accuracy. Across animals, this accuracy is reduced to 91%. This study demonstrates the potential of utilizing the DNN models as a computational framework to understand the neural coding principles in the mammalian brain.

  • 4 authors
·
Oct 26, 2019

Joint encoding of "what" and "when" predictions through error-modulated plasticity in reservoir spiking networks

The brain understands the external world through an internal model that generates predictions and refines them based on prediction errors. A complete prediction specifies what will happen, when it will happen, and with what probability, which we refer to as a "prediction object". Existing models typically capture only what and when, omit probabilities, and rely on biologically-implausible algorithms. Here we show that a single population of spiking neurons can jointly encode the prediction object through a biologically grounded learning mechanism. We implement a heterogeneous Izhikevich spiking reservoir with readouts trained by an error-modulated, attention-gated three-factor Hebbian rule and test it on a novel paradigm that controls both the timing and probability of upcoming stimuli. By integrating real-time learning of "when" with offline consolidation of "what", the model encodes the complete prediction object, firing at the correct times with magnitudes proportional to the probabilities. Critically, it rapidly adapts to changes in both stimulus timing and probability, an ability that global least-squares methods such as FORCE lack without explicit resets. During learning, the model self-organizes its readout weights into near-orthogonal subspaces for "what" and "when," showing that multiplexed encoding arises naturally from generic recurrent dynamics under local, error-gated modulation. These results challenge the view that "what" and "when" predictions require separate modules, suggesting instead that mixed selectivity within shared populations supports flexible predictive cognition. The model also predicts phase-specific neuromodulation and overlapping neural subspaces, offering a parsimonious alternative to hierarchical predictive-coding accounts.

  • 2 authors
·
Oct 16, 2025

Convergent Learning: Do different neural networks learn the same representations?

Recent success in training deep neural networks have prompted active investigation into the features learned on their intermediate layers. Such research is difficult because it requires making sense of non-linear computations performed by millions of parameters, but valuable because it increases our ability to understand current models and create improved versions of them. In this paper we investigate the extent to which neural networks exhibit what we call convergent learning, which is when the representations learned by multiple nets converge to a set of features which are either individually similar between networks or where subsets of features span similar low-dimensional spaces. We propose a specific method of probing representations: training multiple networks and then comparing and contrasting their individual, learned representations at the level of neurons or groups of neurons. We begin research into this question using three techniques to approximately align different neural networks on a feature level: a bipartite matching approach that makes one-to-one assignments between neurons, a sparse prediction approach that finds one-to-many mappings, and a spectral clustering approach that finds many-to-many mappings. This initial investigation reveals a few previously unknown properties of neural networks, and we argue that future research into the question of convergent learning will yield many more. The insights described here include (1) that some features are learned reliably in multiple networks, yet other features are not consistently learned; (2) that units learn to span low-dimensional subspaces and, while these subspaces are common to multiple networks, the specific basis vectors learned are not; (3) that the representation codes show evidence of being a mix between a local code and slightly, but not fully, distributed codes across multiple units.

  • 5 authors
·
Nov 23, 2015

Adaptive coding efficiency in recurrent cortical circuits via gain control

Sensory systems across all modalities and species exhibit adaptation to continuously changing input statistics. Individual neurons have been shown to modulate their response gains so as to maximize information transmission in different stimulus contexts. Experimental measurements have revealed additional, nuanced sensory adaptation effects including changes in response maxima and minima, tuning curve repulsion from the adapter stimulus, and stimulus-driven response decorrelation. Existing explanations of these phenomena rely on changes in inter-neuronal synaptic efficacy, which, while more flexible, are unlikely to operate as rapidly or reversibly as single neuron gain modulations. Using published V1 population adaptation data, we show that propagation of single neuron gain changes in a recurrent network is sufficient to capture the entire set of observed adaptation effects. We propose a novel adaptive efficient coding objective with which single neuron gains are modulated, maximizing the fidelity of the stimulus representation while minimizing overall activity in the network. From this objective, we analytically derive a set of gains that optimize the trade-off between preserving information about the stimulus and conserving metabolic resources. Our model generalizes well-established concepts of single neuron adaptive gain control to recurrent populations, and parsimoniously explains experimental adaptation data.

  • 4 authors
·
May 31, 2023

Neural Codecs as Biosignal Tokenizers

Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

  • 7 authors
·
Oct 10, 2025

Learning Efficient Coding of Natural Images with Maximum Manifold Capacity Representations

The efficient coding hypothesis proposes that the response properties of sensory systems are adapted to the statistics of their inputs such that they capture maximal information about the environment, subject to biological constraints. While elegant, information theoretic properties are notoriously difficult to measure in practical settings or to employ as objective functions in optimization. This difficulty has necessitated that computational models designed to test the hypothesis employ several different information metrics ranging from approximations and lower bounds to proxy measures like reconstruction error. Recent theoretical advances have characterized a novel and ecologically relevant efficiency metric, the manifold capacity, which is the number of object categories that may be represented in a linearly separable fashion. However, calculating manifold capacity is a computationally intensive iterative procedure that until now has precluded its use as an objective. Here we outline the simplifying assumptions that allow manifold capacity to be optimized directly, yielding Maximum Manifold Capacity Representations (MMCR). The resulting method is closely related to and inspired by advances in the field of self supervised learning (SSL), and we demonstrate that MMCRs are competitive with state of the art results on standard SSL benchmarks. Empirical analyses reveal differences between MMCRs and representations learned by other SSL frameworks, and suggest a mechanism by which manifold compression gives rise to class separability. Finally we evaluate a set of SSL methods on a suite of neural predictivity benchmarks, and find MMCRs are higly competitive as models of the ventral stream.

  • 4 authors
·
Mar 6, 2023

Adaptive Estimators Show Information Compression in Deep Neural Networks

To improve how neural networks function it is crucial to understand their learning process. The information bottleneck theory of deep learning proposes that neural networks achieve good generalization by compressing their representations to disregard information that is not relevant to the task. However, empirical evidence for this theory is conflicting, as compression was only observed when networks used saturating activation functions. In contrast, networks with non-saturating activation functions achieved comparable levels of task performance but did not show compression. In this paper we developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also find that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization.

  • 3 authors
·
Feb 24, 2019

pyhgf: A neural network library for predictive coding

Bayesian models of cognition have gained considerable traction in computational neuroscience and psychiatry. Their scopes are now expected to expand rapidly to artificial intelligence, providing general inference frameworks to support embodied, adaptable, and energy-efficient autonomous agents. A central theory in this domain is predictive coding, which posits that learning and behaviour are driven by hierarchical probabilistic inferences about the causes of sensory inputs. Biological realism constrains these networks to rely on simple local computations in the form of precision-weighted predictions and prediction errors. This can make this framework highly efficient, but its implementation comes with unique challenges on the software development side. Embedding such models in standard neural network libraries often becomes limiting, as these libraries' compilation and differentiation backends can force a conceptual separation between optimization algorithms and the systems being optimized. This critically departs from other biological principles such as self-monitoring, self-organisation, cellular growth and functional plasticity. In this paper, we introduce pyhgf: a Python package backed by JAX and Rust for creating, manipulating and sampling dynamic networks for predictive coding. We improve over other frameworks by enclosing the network components as transparent, modular and malleable variables in the message-passing steps. The resulting graphs can implement arbitrary computational complexities as beliefs propagation. But the transparency of core variables can also translate into inference processes that leverage self-organisation principles, and express structure learning, meta-learning or causal discovery as the consequence of network structural adaptation to surprising inputs. The code, tutorials and documentation are hosted at: https://github.com/ilabcode/pyhgf.

  • 7 authors
·
Oct 11, 2024

Recurrent Neural Network Learning of Performance and Intrinsic Population Dynamics from Sparse Neural Data

Recurrent Neural Networks (RNNs) are popular models of brain function. The typical training strategy is to adjust their input-output behavior so that it matches that of the biological circuit of interest. Even though this strategy ensures that the biological and artificial networks perform the same computational task, it does not guarantee that their internal activity dynamics match. This suggests that the trained RNNs might end up performing the task employing a different internal computational mechanism, which would make them a suboptimal model of the biological circuit. In this work, we introduce a novel training strategy that allows learning not only the input-output behavior of an RNN but also its internal network dynamics, based on sparse neural recordings. We test the proposed method by training an RNN to simultaneously reproduce internal dynamics and output signals of a physiologically-inspired neural model. Specifically, this model generates the multiphasic muscle-like activity patterns typically observed during the execution of reaching movements, based on the oscillatory activation patterns concurrently observed in the motor cortex. Remarkably, we show that the reproduction of the internal dynamics is successful even when the training algorithm relies on the activities of a small subset of neurons sampled from the biological network. Furthermore, we show that training the RNNs with this method significantly improves their generalization performance. Overall, our results suggest that the proposed method is suitable for building powerful functional RNN models, which automatically capture important computational properties of the biological circuit of interest from sparse neural recordings.

  • 2 authors
·
May 5, 2020

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

Invasive brain-computer interfaces have garnered significant attention due to their high performance. The current intracranial stereoElectroEncephaloGraphy (sEEG) foundation models typically build univariate representations based on a single channel. Some of them further use Transformer to model the relationship among channels. However, due to the locality and specificity of brain computation, their performance on more difficult tasks, e.g., speech decoding, which demands intricate processing in specific brain regions, is yet to be fully investigated. We hypothesize that building multi-variate representations within certain brain regions can better capture the specific neural processing. To explore this hypothesis, we collect a well-annotated Chinese word-reading sEEG dataset, targeting language-related brain networks, over 12 subjects. Leveraging this benchmark dataset, we developed the Du-IN model that can extract contextual embeddings from specific brain regions through discrete codebook-guided mask modeling. Our model achieves SOTA performance on the downstream 61-word classification task, surpassing all baseline models. Model comparison and ablation analysis reveal that our design choices, including (i) multi-variate representation by fusing channels in vSMC and STG regions and (ii) self-supervision by discrete codebook-guided mask modeling, significantly contribute to these performances. Collectively, our approach, inspired by neuroscience findings, capitalizing on multi-variate neural representation from specific brain regions, is suitable for invasive brain modeling. It marks a promising neuro-inspired AI approach in BCI.

  • 9 authors
·
May 19, 2024

Lossless Compression with Probabilistic Circuits

Despite extensive progress on image generation, common deep generative model architectures are not easily applied to lossless compression. For example, VAEs suffer from a compression cost overhead due to their latent variables. This overhead can only be partially eliminated with elaborate schemes such as bits-back coding, often resulting in poor single-sample compression rates. To overcome such problems, we establish a new class of tractable lossless compression models that permit efficient encoding and decoding: Probabilistic Circuits (PCs). These are a class of neural networks involving |p| computational units that support efficient marginalization over arbitrary subsets of the D feature dimensions, enabling efficient arithmetic coding. We derive efficient encoding and decoding schemes that both have time complexity O (log(D) cdot |p|), where a naive scheme would have linear costs in D and |p|, making the approach highly scalable. Empirically, our PC-based (de)compression algorithm runs 5-40 times faster than neural compression algorithms that achieve similar bitrates. By scaling up the traditional PC structure learning pipeline, we achieve state-of-the-art results on image datasets such as MNIST. Furthermore, PCs can be naturally integrated with existing neural compression algorithms to improve the performance of these base models on natural image datasets. Our results highlight the potential impact that non-standard learning architectures may have on neural data compression.

  • 3 authors
·
Nov 22, 2021

DeeperBrain: A Neuro-Grounded EEG Foundation Model Towards Universal BCI

Electroencephalography (EEG) foundation models hold significant promise for universal Brain-Computer Interfaces (BCIs). However, existing approaches often rely on end-to-end fine-tuning and exhibit limited efficacy under frozen-probing protocols, lacking the intrinsic universality required for broad generalization. This limitation stems from adapting general-purpose sequence architectures that overlook the biophysical and dynamical principles of neural activity. To bridge this gap, we propose DeeperBrain, a neuro-grounded foundation model integrating domain-specific inductive biases into its model design and learning objectives. Architecturally, DeeperBrain incorporates a volume conduction-aware channel encoding to model spatial mixing via 3D geometry, and a neurodynamics-aware temporal encoding capturing slow adaptations using oscillatory and exponential bases. For pretraining, we introduce a dual-objective strategy combining Masked EEG Reconstruction (MER) for local fidelity and Neurodynamics Statistics Prediction (NSP). NSP enforces alignment with macroscopic brain states by predicting interpretable order parameters, including spectral power, functional connectivity, cross-frequency coupling, and dynamic complexity. Extensive experiments demonstrate that DeeperBrain achieves state-of-the-art or highly competitive performance under end-to-end fine-tuning. Crucially, it maintains superior efficacy under a rigorous frozen-probing protocol, verifying that embedding neuroscientific first principles endows learned representations with the intrinsic universality essential for universal BCI. The code will be publicly available.

  • 6 authors
·
Jan 5

NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models

Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.

  • 7 authors
·
Oct 14, 2025

How do neurons operate on sparse distributed representations? A mathematical theory of sparsity, neurons and active dendrites

We propose a formal mathematical model for sparse representations and active dendrites in neocortex. Our model is inspired by recent experimental findings on active dendritic processing and NMDA spikes in pyramidal neurons. These experimental and modeling studies suggest that the basic unit of pattern memory in the neocortex is instantiated by small clusters of synapses operated on by localized non-linear dendritic processes. We derive a number of scaling laws that characterize the accuracy of such dendrites in detecting activation patterns in a neuronal population under adverse conditions. We introduce the union property which shows that synapses for multiple patterns can be randomly mixed together within a segment and still lead to highly accurate recognition. We describe simulation results that provide further insight into sparse representations as well as two primary results. First we show that pattern recognition by a neuron with active dendrites can be extremely accurate and robust with high dimensional sparse inputs even when using a tiny number of synapses to recognize large patterns. Second, equations representing recognition accuracy of a dendrite predict optimal NMDA spiking thresholds under a generous set of assumptions. The prediction tightly matches NMDA spiking thresholds measured in the literature. Our model matches many of the known properties of pyramidal neurons. As such the theory provides a mathematical framework for understanding the benefits and limits of sparse representations in cortical networks.

  • 2 authors
·
Jan 4, 2016

Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.

  • 4 authors
·
Dec 15, 2025

Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures

While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.

  • 6 authors
·
Feb 4

MLICv2: Enhanced Multi-Reference Entropy Modeling for Learned Image Compression

Recent advancements in learned image compression (LIC) have yielded impressive performance gains. Notably, the learned image compression models with multi-reference entropy models (MLIC series) have significantly outperformed existing traditional image codecs such as the Versatile Video Coding (VVC) Intra. In this paper, we present MLICv2 and MLICv2^+, enhanced versions of the MLIC series, featuring improved transform techniques, entropy modeling, and instance adaptability. For better transform, we introduce a simple token mixing transform block inspired by the meta transformer architecture, addressing the performance degradation at high bit-rates observed in previous MLIC series while maintaining computational efficiency. To enhance entropy modeling, we propose a hyperprior-guided global correlation prediction, enabling the capture of global contexts in the initial slice of the latent representation. We also develop a channel reweighting module to dynamically prioritize important channels within each context. Additionally, advanced positional embedding for context modeling and selective compression with guided optimization are investigated. To boost instance adaptability, we employ stochastic Gumbel annealing to iteratively refine the latent representation according to the rate-distortion optimization of a specific input image. This approach further enhances performance without impacting decoding speed. Experimental results demonstrate that our MLICv2 and MLICv2^+ achieve state-of-the-art performance, reducing Bjontegaard-Delta rate (BD-rate) by 16.54%, 21.61%, 16.05% and 20.46%, 24.35%, 19.14% respectively, compared to VTM-17.0 Intra on the Kodak, Tecnick, CLIC Pro Val dataset, respectively.

  • 5 authors
·
Apr 27, 2025

RECOMBINER: Robust and Enhanced Compression with Bayesian Implicit Neural Representations

COMpression with Bayesian Implicit NEural Representations (COMBINER) is a recent data compression method that addresses a key inefficiency of previous Implicit Neural Representation (INR)-based approaches: it avoids quantization and enables direct optimization of the rate-distortion performance. However, COMBINER still has significant limitations: 1) it uses factorized priors and posterior approximations that lack flexibility; 2) it cannot effectively adapt to local deviations from global patterns in the data; and 3) its performance can be susceptible to modeling choices and the variational parameters' initializations. Our proposed method, Robust and Enhanced COMBINER (RECOMBINER), addresses these issues by 1) enriching the variational approximation while retaining a low computational cost via a linear reparameterization of the INR weights, 2) augmenting our INRs with learnable positional encodings that enable them to adapt to local details and 3) splitting high-resolution data into patches to increase robustness and utilizing expressive hierarchical priors to capture dependency across patches. We conduct extensive experiments across several data modalities, showcasing that RECOMBINER achieves competitive results with the best INR-based methods and even outperforms autoencoder-based codecs on low-resolution images at low bitrates. Our PyTorch implementation is available at https://github.com/cambridge-mlg/RECOMBINER/.

  • 4 authors
·
Sep 29, 2023

Perforated Backpropagation: A Neuroscience Inspired Extension to Artificial Neural Networks

The neurons of artificial neural networks were originally invented when much less was known about biological neurons than is known today. Our work explores a modification to the core neuron unit to make it more parallel to a biological neuron. The modification is made with the knowledge that biological dendrites are not simply passive activation funnels, but also compute complex non-linear functions as they transmit activation to the cell body. The paper explores a novel system of "Perforated" backpropagation empowering the artificial neurons of deep neural networks to achieve better performance coding for the same features they coded for in the original architecture. After an initial network training phase, additional "Dendrite Nodes" are added to the network and separately trained with a different objective: to correlate their output with the remaining error of the original neurons. The trained Dendrite Nodes are then frozen, and the original neurons are further trained, now taking into account the additional error signals provided by the Dendrite Nodes. The cycle of training the original neurons and then adding and training Dendrite Nodes can be repeated several times until satisfactory performance is achieved. Our algorithm was successfully added to modern state-of-the-art PyTorch networks across multiple domains, improving upon original accuracies and allowing for significant model compression without a loss in accuracy.

  • 2 authors
·
Jan 29, 2025

NERV++: An Enhanced Implicit Neural Video Representation

Neural fields, also known as implicit neural representations (INRs), have shown a remarkable capability of representing, generating, and manipulating various data types, allowing for continuous data reconstruction at a low memory footprint. Though promising, INRs applied to video compression still need to improve their rate-distortion performance by a large margin, and require a huge number of parameters and long training iterations to capture high-frequency details, limiting their wider applicability. Resolving this problem remains a quite challenging task, which would make INRs more accessible in compression tasks. We take a step towards resolving these shortcomings by introducing neural representations for videos NeRV++, an enhanced implicit neural video representation, as more straightforward yet effective enhancement over the original NeRV decoder architecture, featuring separable conv2d residual blocks (SCRBs) that sandwiches the upsampling block (UB), and a bilinear interpolation skip layer for improved feature representation. NeRV++ allows videos to be directly represented as a function approximated by a neural network, and significantly enhance the representation capacity beyond current INR-based video codecs. We evaluate our method on UVG, MCL JVC, and Bunny datasets, achieving competitive results for video compression with INRs. This achievement narrows the gap to autoencoder-based video coding, marking a significant stride in INR-based video compression research.

  • 3 authors
·
Feb 28, 2024

Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Large language models deployed in the wild must adapt to evolving data, user behavior, and task mixtures without erasing previously acquired capabilities. In practice, this remains difficult: sequential updates induce catastrophic forgetting, while many stabilization methods rely on external procedures that are costly, brittle, or difficult to scale. We present TRC^{2} (Thalamically Routed Cortical Columns), a decoder-only architecture that makes continual adaptation a property of the backbone itself. TRC^{2} combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. This design localizes fast plasticity while preserving a slower stable computation pathway. We further introduce a causal memory-update scheme and an online replay controller that adjusts consolidation strength from measured forgetting. Across a task-sequential language-modeling stream over C4, WikiText-103, and GSM8K, TRC^{2} consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, and DeepSeek baselines trained under the same pipeline. Ablations show that the thalamic and hippocampal components are central to the retention gains, while the full model remains competitive in throughput and training cost.

  • 1 authors
·
Feb 25 2

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.

Neuro-inspired Ensemble-to-Ensemble Communication Primitives for Sparse and Efficient ANNs

The structure of biological neural circuits-modular, hierarchical, and sparsely interconnected-reflects an efficient trade-off between wiring cost, functional specialization, and robustness. These principles offer valuable insights for artificial neural network (ANN) design, especially as networks grow in depth and scale. Sparsity, in particular, has been widely explored for reducing memory and computation, improving speed, and enhancing generalization. Motivated by systems neuroscience findings, we explore how patterns of functional connectivity in the mouse visual cortex-specifically, ensemble-to-ensemble communication, can inform ANN design. We introduce G2GNet, a novel architecture that imposes sparse, modular connectivity across feedforward layers. Despite having significantly fewer parameters than fully connected models, G2GNet achieves superior accuracy on standard vision benchmarks. To our knowledge, this is the first architecture to incorporate biologically observed functional connectivity patterns as a structural bias in ANN design. We complement this static bias with a dynamic sparse training (DST) mechanism that prunes and regrows edges during training. We also propose a Hebbian-inspired rewiring rule based on activation correlations, drawing on principles of biological plasticity. G2GNet achieves up to 75% sparsity while improving accuracy by up to 4.3% on benchmarks, including Fashion-MNIST, CIFAR-10, and CIFAR-100, outperforming dense baselines with far fewer computations.

  • 3 authors
·
Aug 19, 2025

JPmHC Dynamical Isometry via Orthogonal Hyper-Connections

Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines, with a rank-p Grassmannian variant tracking between the two -- consistent with the spectral theory predictions. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution. \newline \newline

  • 3 authors
·
Feb 20

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

HyperZcdotZcdotW Operator Connects Slow-Fast Networks for Full Context Interaction

The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a HyperZ{cdotZ{cdot}W} operator that connects hyper-kernels (W) and hidden activations (Z) through simple elementwise multiplication, followed by convolution of Z using the context-dependent W. Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.

  • 1 authors
·
Jan 31, 2024 1

TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20times; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3times faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .

  • 3 authors
·
Feb 18

Learned Compression for Compressed Learning

Modern sensors produce increasingly rich streams of high-resolution data. Due to resource constraints, machine learning systems discard the vast majority of this information via resolution reduction. Compressed-domain learning allows models to operate on compact latent representations, allowing higher effective resolution for the same budget. However, existing compression systems are not ideal for compressed learning. Linear transform coding and end-to-end learned compression systems reduce bitrate, but do not uniformly reduce dimensionality; thus, they do not meaningfully increase efficiency. Generative autoencoders reduce dimensionality, but their adversarial or perceptual objectives lead to significant information loss. To address these limitations, we introduce WaLLoC (Wavelet Learned Lossy Compression), a neural codec architecture that combines linear transform coding with nonlinear dimensionality-reducing autoencoders. WaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck between an invertible wavelet packet transform. Across several key metrics, WaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion models. WaLLoC does not require perceptual or adversarial losses to represent high-frequency detail, providing compatibility with modalities beyond RGB images and stereo audio. WaLLoC's encoder consists almost entirely of linear operations, making it exceptionally efficient and suitable for mobile computing, remote sensing, and learning directly from compressed data. We demonstrate WaLLoC's capability for compressed-domain learning across several tasks, including image classification, colorization, document understanding, and music source separation. Our code, experiments, and pre-trained audio and image codecs are available at https://ut-sysml.org/walloc

  • 2 authors
·
Dec 12, 2024 2

Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models

In the field of neural data compression, the prevailing focus has been on optimizing algorithms for either classical distortion metrics, such as PSNR or SSIM, or human perceptual quality. With increasing amounts of data consumed by machines rather than humans, a new paradigm of machine-oriented compressionx2013which prioritizes the retention of features salient for machine perception over traditional human-centric criteriax2013has emerged, creating several new challenges to the development, evaluation, and deployment of systems utilizing lossy compression. In particular, it is unclear how different approaches to lossy compression will affect the performance of downstream machine perception tasks. To address this under-explored area, we evaluate various perception modelsx2013including image classification, image segmentation, speech recognition, and music source separationx2013under severe lossy compression. We utilize several popular codecs spanning conventional, neural, and generative compression architectures. Our results indicate three key findings: (1) using generative compression, it is feasible to leverage highly compressed data while incurring a negligible impact on machine perceptual quality; (2) machine perceptual quality correlates strongly with deep similarity metrics, indicating a crucial role of these metrics in the development of machine-oriented codecs; and (3) using lossy compressed datasets, (e.g. ImageNet) for pre-training can lead to counter-intuitive scenarios where lossy compression increases machine perceptual quality rather than degrading it. To encourage engagement on this growing area of research, our code and experiments are available at: https://github.com/danjacobellis/MPQ.

  • 3 authors
·
Jan 15, 2024

PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders

Neural speech codecs have recently emerged as a focal point in the fields of speech compression and generation. Despite this progress, achieving high-quality speech reconstruction under low-bitrate scenarios remains a significant challenge. In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths. Specifically, we first introduce PSCodec-Base, which leverages a pretrained speaker verification model-based prompt encoder (VPP-Enc) and a learnable Mel-spectrogram-based prompt encoder (MelP-Enc) to effectively disentangle and integrate voiceprint and Mel-related features in utterances. To further enhance feature utilization efficiency, we propose PSCodec-DRL-ICT, incorporating a structural similarity (SSIM) based disentangled representation loss (DRL) and an incremental continuous training (ICT) strategy. While PSCodec-DRL-ICT demonstrates impressive performance, its reliance on extensive hyperparameter tuning and multi-stage training makes it somewhat labor-intensive. To circumvent these limitations, we propose PSCodec-CasAN, utilizing an advanced cascaded attention network (CasAN) to enhance representational capacity of the entire system. Extensive experiments show that our proposed PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN all significantly outperform several state-of-the-art neural codecs, exhibiting substantial improvements in both speech reconstruction quality and speaker similarity under low-bitrate conditions.

  • 9 authors
·
Apr 3, 2024

Learning dynamic representations of the functional connectome in neurobiological networks

The static synaptic connectivity of neuronal circuits stands in direct contrast to the dynamics of their function. As in changing community interactions, different neurons can participate actively in various combinations to effect behaviors at different times. We introduce an unsupervised approach to learn the dynamic affinities between neurons in live, behaving animals, and to reveal which communities form among neurons at different times. The inference occurs in two major steps. First, pairwise non-linear affinities between neuronal traces from brain-wide calcium activity are organized by non-negative tensor factorization (NTF). Each factor specifies which groups of neurons are most likely interacting for an inferred interval in time, and for which animals. Finally, a generative model that allows for weighted community detection is applied to the functional motifs produced by NTF to reveal a dynamic functional connectome. Since time codes the different experimental variables (e.g., application of chemical stimuli), this provides an atlas of neural motifs active during separate stages of an experiment (e.g., stimulus application or spontaneous behaviors). Results from our analysis are experimentally validated, confirming that our method is able to robustly predict causal interactions between neurons to generate behavior. Code is available at https://github.com/dyballa/dynamic-connectomes.

  • 5 authors
·
Feb 21, 2024

Pointer Networks

We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems -- finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem -- using training examples alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.

  • 3 authors
·
Jun 9, 2015

LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation with explicit states. However, these approaches often incur significant performance degradation. The ultimate goal is to develop a model that has the following properties: parallel training, streaming and low-cost inference, and SOTA performance. In this paper, we propose a new direction to achieve this goal. We show how architectural modifications to a recurrent model can help push its performance toward Transformer models while retaining its sequential processing capability. Specifically, inspired by the recent success of Legendre Memory Units (LMU) in sequence learning tasks, we propose LMUFormer, which augments the LMU with convolutional patch embedding and convolutional channel mixer. Moreover, we present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules while simultaneously reducing the computing complexity. We evaluated our architectures on multiple sequence datasets. In comparison to SOTA transformer-based models within the ANN domain on the SCv2 dataset, our LMUFormer demonstrates comparable performance while necessitating a remarkable 53 times reduction in parameters and a substantial 65 times decrement in FLOPs. Additionally, owing to our model's proficiency in real-time data processing, we can achieve a 32.03% reduction in sequence length, all while incurring an inconsequential decline in performance. Our code is publicly available at https://github.com/zeyuliu1037/LMUFormer.git.

  • 4 authors
·
Jan 19, 2024

Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech

Interpretability work on the convolutional layers of CNNs has primarily focused on computer vision, but some studies also explore correspondences between the latent space and the output in the audio domain. However, it has not been thoroughly examined how acoustic and linguistic information is represented in the fully connected (FC) layer that bridges the latent space and convolutional layers. The current study presents the first exploration of how the FC layer of CNNs for speech synthesis encodes linguistically relevant information. We propose two techniques for exploration of the fully connected layer. In Experiment 1, we use weight matrices as inputs into convolutional layers. In Experiment 2, we manipulate the FC layer to explore how symbolic-like representations are encoded in CNNs. We leverage the fact that the FC layer outputs a feature map and that variable-specific weight matrices are temporally structured to (1) demonstrate how the distribution of learned weights varies between latent variables in systematic ways and (2) demonstrate how manipulating the FC layer while holding constant subsequent model parameters affects the output. We ultimately present an FC manipulation that can output a single segment. Using this technique, we show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights, showing that ciwGAN encodes lexical information in a linguistically principled manner.

  • 2 authors
·
Jan 13, 2025

How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator W_{ij}(X), making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which W_{ij}(X) varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which W_{ij} is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a k-dimensional subspace on length-n sequences requires and is achievable with H=k heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

  • 1 authors
·
Dec 17, 2025

S2LIC: Learned Image Compression with the SwinV2 Block, Adaptive Channel-wise and Global-inter Attention Context

Recently, deep learning technology has been successfully applied in the field of image compression, leading to superior rate-distortion performance. It is crucial to design an effective and efficient entropy model to estimate the probability distribution of the latent representation. However, the majority of entropy models primarily focus on one-dimensional correlation processing between channel and spatial information. In this paper, we propose an Adaptive Channel-wise and Global-inter attention Context (ACGC) entropy model, which can efficiently achieve dual feature aggregation in both inter-slice and intraslice contexts. Specifically, we divide the latent representation into different slices and then apply the ACGC model in a parallel checkerboard context to achieve faster decoding speed and higher rate-distortion performance. In order to capture redundant global features across different slices, we utilize deformable attention in adaptive global-inter attention to dynamically refine the attention weights based on the actual spatial relationships and context. Furthermore, in the main transformation structure, we propose a high-performance S2LIC model. We introduce the residual SwinV2 Transformer model to capture global feature information and utilize a dense block network as the feature enhancement module to improve the nonlinear representation of the image within the transformation structure. Experimental results demonstrate that our method achieves faster encoding and decoding speeds and outperforms VTM-17.1 and some recent learned image compression methods in both PSNR and MS-SSIM metrics.

  • 4 authors
·
Mar 21, 2024

Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings

Spiking Neural Networks (SNNs) are a promising research direction for building power-efficient information processing systems, especially for temporal tasks such as speech recognition. In SNNs, delays refer to the time needed for one spike to travel from one neuron to another. These delays matter because they influence the spike arrival times, and it is well-known that spiking neurons respond more strongly to coincident input spikes. More formally, it has been shown theoretically that plastic delays greatly increase the expressivity in SNNs. Yet, efficient algorithms to learn these delays have been lacking. Here, we propose a new discrete-time algorithm that addresses this issue in deep feedforward SNNs using backpropagation, in an offline manner. To simulate delays between consecutive layers, we use 1D convolutions across time. The kernels contain only a few non-zero weights - one per synapse - whose positions correspond to the delays. These positions are learned together with the weights using the recently proposed Dilated Convolution with Learnable Spacings (DCLS). We evaluated our method on three datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC) and its non-spiking version Google Speech Commands v0.02 (GSC) benchmarks, which require detecting temporal patterns. We used feedforward SNNs with two or three hidden fully connected layers, and vanilla leaky integrate-and-fire neurons. We showed that fixed random delays help and that learning them helps even more. Furthermore, our method outperformed the state-of-the-art in the three datasets without using recurrent connections and with substantially fewer parameters. Our work demonstrates the potential of delay learning in developing accurate and precise models for temporal data processing. Our code is based on PyTorch / SpikingJelly and available at: https://github.com/Thvnvtos/SNN-delays

  • 3 authors
·
Jun 30, 2023