EE5311 CA1 · Group 21
From Physics to Silicon: Validating and Profiling Liquid Time-Constant Networks
Dynamic time constants buy balanced adaptability across physical regimes — but what does that cost when you cross framework boundaries?
EE5311 CA1 · Group 21 | April 2026 · GitHub Repository
Gong Chang · Yan Yihan · Zhang Xirui · Shen Yaheng · Yu Xintong
01 · From Discrete Steps to Continuous Time
The deep learning revolution was built on discrete operations. A residual network stacks layers; a recurrent network ticks through time steps; a Transformer attends over positions in a sequence. These architectures are extraordinarily powerful — but they share a quiet assumption: time is a grid.
Physical systems do not live on grids. A pendulum does not wait for the next clock tick to swing. An electrical circuit does not discretise its own transient response. The dynamics are continuous, and any grid we impose is an approximation — one whose resolution we must choose before training, and whose errors we must accept at inference.
Neural ODEs (Chen et al., 2018) broke this assumption. By replacing discrete layers with a continuous-depth formulation $dx/dt = f_\theta(x, I)$ and delegating integration to an adaptive ODE solver, they let the data dictate when the system needs fine resolution and when it can coast. This is elegant — but also generic. The right-hand side $f_\theta$ has no structural notion of how fast the system should respond. When the same network is asked to model a lightly damped spring (ringing for seconds) and an overdamped door closer (settling in a fraction of one), it must discover temporal scale entirely from gradient signal, with no inductive bias to guide it.
This is the gap that Liquid Time-Constant (LTC) networks address. By making the time constant τ an explicit, input-dependent variable:
$$\frac{dx}{dt} = -\frac{x}{\tau(x, I)} + f(x, I)$$the model gains a built-in speed dial. It can slow itself down for gentle dynamics and speed up for violent ones — mirroring how biological neurons modulate their membrane time constants in response to synaptic input.
But does this actually help in practice? And if it does — what does the extra expressiveness cost when the same model is compiled and executed across Julia, PyTorch, and JAX?
This post answers both questions. We design a systematic benchmark on a damped harmonic oscillator spanning five physical regimes, compare LTC against two baselines (Neural ODE and CTRNN), and then profile the identical model across three frameworks — exposing not just runtime differences, but fundamentally different approaches to compiling continuous-time differentiable programs.
02 · Three Ways to Model Time
We compare three continuous-time architectures that differ in exactly one design choice: how they handle temporal scale.
Neural ODE replaces discrete layers with a learned vector field. It is maximally flexible but temporally agnostic — there is no parameter that encodes how quickly the system should respond to a change in input:
function neural_ode_dynamics(x, cell, I)
h = tanh.(cell.Wx1 * x .+ cell.Ui1 * I .+ cell.b1)
cell.Wx2 * h .+ cell.b2 # dx/dt — no τ
end
CTRNN (Beer, 1995) introduces a learnable time constant per neuron, adding an explicit leak term:
$$\frac{dx}{dt} = -\frac{x}{\tau} + \tanh(Wx + UI + b)$$Each neuron owns a scalar τ trained alongside the weights. After training, τ is frozen — a neuron that learned to be "slow" stays slow regardless of what signal it receives:
function ctrnn_dynamics(x, cell, I)
τ = exp.(cell.log_τ) # fixed after training
drift = tanh.(cell.W * x .+ cell.U * I .+ cell.b)
@. -x / τ + drift
end
LTC (Hasani et al., 2021) makes τ a dynamic function of state and input:
$$\tau(x, I) = \tau_0 + \text{softplus}(W_\tau x + U_\tau I + b_\tau)$$ $$f(x, I) = \tanh(W_f x + U_f I + b_f)$$ $$\frac{dx}{dt} = -\frac{x}{\tau(x, I)} + f(x, I)$$The softplus ensures τ > τ₀ > 0, preventing degenerate zero-timescale dynamics. Because τ now depends on I, the same neuron can respond quickly to a sharp transient and slowly to a gentle drift:
function ltc_dynamics(x, cell, I)
τ_val = cell.τ₀ .+ softplus.(cell.Wτ * x .+ cell.Uτ * I .+ cell.bτ) # τ(x,I)
f_val = tanh.(cell.Wf * x .+ cell.Uf * I .+ cell.bf) # f(x,I)
@. -x / τ_val + f_val # dx/dt
end
The progression is clean: no τ → fixed τ → dynamic τ(x, I). Each step adds temporal structure to the ODE right-hand side, with a corresponding increase in parameters and per-evaluation cost. Every LTC evaluation must compute τ(x, I) alongside f(x, I), roughly doubling per-step FLOP count and introducing an extra set of weight matrices (Wτ, Uτ, bτ, τ₀).
MDL-Inspired τ Regularisation
Dynamic τ is more expressive, but expressiveness unchecked can overfit. We draw on the Minimum Description Length (MDL) principle — select the model that compresses data plus model description most efficiently — to penalise unnecessarily reactive time constants:
$$\mathcal{L} = \text{MSE} + \lambda \cdot \Omega(\tau), \qquad \Omega(\tau) = -\,\text{mean}(\log \tau)$$Small τ values encode a model that reacts to every fluctuation — high description complexity. The −log τ penalty discourages this, preferring the simplest dynamics consistent with the data.
We test a physics-guided adaptive rule λ = λ₀ · ζ: overdamped systems, whose dynamics are intrinsically simpler, receive stronger complexity pressure. This directly encodes the prior that simpler physics should yield simpler models.
03 · Experiment Design
We benchmark all three models on a damped harmonic oscillator across five qualitatively distinct regimes spanning the full physical spectrum:
| ζ | Character | Behaviour |
|---|---|---|
| 0.1 | Strongly underdamped | Sustained oscillations, slow envelope decay |
| 0.3 | Underdamped | Visible oscillations, moderate decay |
| 0.5 | Moderately damped | Transition zone |
| 0.8 | Lightly overdamped | Monotone decay, slight initial overshoot |
| 1.2 | Overdamped | Pure exponential decay |
Data are uniformly sampled on $t \in [0, 20]$ (200 time points), split into interleaved training and test sets that cover the full time interval. All models share identical hyperparameters:
const EPOCHS = 300; const HIDDEN_DIM = 8; const LR = 1f-3; const SEED = 42
Evaluation protocol. All evaluations use autonomous closed-loop rollout, not teacher forcing. The model is warm-started by broadcasting $x_0$ over all time steps, then rolled out in closed loop using its own previous predictions as subsequent inputs. This is far more demanding than single-step prediction — it exposes whether the model has learned stable, self-consistent dynamics or merely curve-fitted local gradients.
We assess four capabilities: base prediction accuracy, generalisation to unseen regimes, robustness to initial-state noise, and regime-switch adaptability.
04 · Does Dynamic τ Help?
Base Prediction Accuracy
Dynamic τ does not dominate every regime — but it never fails catastrophically.
LTC and CTRNN tend to outperform Neural ODE, confirming that explicitly modelling temporal scale is beneficial. LTC's advantage is not uniform: it excels near ζ = 0.5, yet struggles at ζ = 0.1 (strong oscillations), where model capacity — not τ design — is the bottleneck. Dynamic τ provides stability rather than dominance at the in-distribution level. The picture changes substantially when models are pushed out of distribution.
Generalisation
The transfer gap is where dynamic τ pulls ahead.
In the full cross-ζ transfer setting, LTC achieves the most stable trend across test regimes. CTRNN also performs well in some intervals, but both LTC and CTRNN clearly outperform Neural ODE — confirming that temporal inductive bias is a genuine asset under distribution shift.
In the extrapolation setting (train on 0.1, 0.3, 0.5, 0.8, test on unseen ζ = 1.2), performance depends strongly on training regime. When trained on ζ = 0.5, LTC achieves the lowest transfer error — exactly where dynamic τ has the most room to adapt without being constrained by either extreme.
Noise Robustness
Gaussian noise is injected into the initial state. Degradation ratios stay close to 1.0 for all three models — perturbing only the starting point does not destabilise autonomous rollout. LTC exhibits a slightly smaller average degradation and smoother trend; Neural ODE is more prone to performance degradation at higher noise levels. Differences are mild overall.
Regime-Switch Adaptability
The strongest evidence: LTC is the only architecture that never fails catastrophically across all four regime switches.
At $t = 10$, the damping ratio ζ changes abruptly. Each model was trained on single-ζ data for the pre-switch regime and must adapt without any explicit switch signal. This directly probes out-of-distribution generalisation at inference time.
| Switch | LTC | Neural ODE | CTRNN | Worst performer |
|---|---|---|---|---|
| 0.1 → 1.2 | 0.093 | 2.250 | 0.064 | NODE — catastrophic |
| 1.2 → 0.1 | 0.089 | 0.003 | 0.045 | LTC |
| 0.3 → 0.8 | 0.019 | 0.077 | 0.002 | NODE |
| 0.8 → 0.3 | 0.025 | 0.020 | 0.092 | CTRNN |
Bold = best per row.
In the extreme switch ζ = 0.1 → 1.2 (strongly oscillatory to overdamped), Neural ODE diverges severely — post-switch MSE reaches 2.250, an order of magnitude worse than LTC (0.093) and CTRNN (0.064). The reverse case ζ = 1.2 → 0.1 is harder for all models since none can generate oscillations absent from their training regime; here NODE achieves the lowest post-switch MSE (0.003) while LTC and CTRNN plateau at a non-zero offset.
No single model wins every transition — CTRNN takes two, NODE takes two, LTC takes none outright. But the failure modes are asymmetric: LTC is consistently second-best or close in every case. It is the only architecture that never breaks. The full quantitative table:
| Switch | Model | Pre-switch MSE | Post-switch MSE |
|---|---|---|---|
| 0.1 → 1.2 | LTC | 0.2434 | 0.0928 |
| NODE | 0.6568 | 2.2499 | |
| CTRNN | 0.2375 | 0.0639 | |
| 1.2 → 0.1 | LTC | 0.0879 | 0.0886 |
| NODE | 0.0450 | 0.0030 | |
| CTRNN | 0.0685 | 0.0447 | |
| 0.3 → 0.8 | LTC | 0.0955 | 0.0186 |
| NODE | 0.1143 | 0.0775 | |
| CTRNN | 0.0647 | 0.0024 | |
| 0.8 → 0.3 | LTC | 0.0645 | 0.0249 |
| NODE | 0.0638 | 0.0197 | |
| CTRNN | 0.1047 | 0.0918 |
Fig 9 · Autonomous rollout under abrupt regime switch at t = 10. Dashed = ground truth · Blue = LTC · Red = NODE · Green = CTRNN.
How τ Adapts
Visualising mean τ(t) during the post-switch window reveals a physically interpretable pattern. Across all three representative regimes, τ increases monotonically after the switch — rising from roughly 0.736 to 0.740 (ζ = 0.1), 0.727 to 0.730 (ζ = 0.5), and 0.690 to 0.693 (ζ = 1.2).
This is physically interpretable: after an abrupt regime change, the network responds by slowing down. Increasing τ attenuates dx/dt, acting as a low-pass filter that smooths the network's reaction to unexpected signals. The absolute adaptation range is modest (~0.5), but the fact that this cautious behaviour emerges without any explicit switch signal confirms that the LTC architecture has learned a structurally meaningful temporal adjustment strategy.
MDL Regularisation Amplifies the Advantage
| Switch | Vanilla LTC | LTC + MDL | Improvement |
|---|---|---|---|
| 0.1 → 1.2 | 0.0928 | 0.0379 | +59% |
| 1.2 → 0.1 | 0.0886 | 0.0408 | +54% |
| 0.3 → 0.8 | 0.0186 | 0.0307 | −65% |
| 0.8 → 0.3 | 0.0249 | 0.0003 | +99% |
MDL regularisation delivers 50–99% improvement on three of four switch pairs. The adaptive rule λ = λ₀ · ζ applies stronger regularisation to higher-ζ regimes: when the model trains on overdamped dynamics, the complexity pressure prevents τ from overfitting to a single regime, yielding dramatically better transfer at the switch point.
The one exception (0.3 → 0.8) is a boundary case. The pre-switch regime ζ = 0.3 receives only light regularisation (λ = 0.003), so the model retains a reactive τ that is appropriate for underdamped dynamics but marginally hinders on the nearby overdamped side. Vanilla LTC already handles this mild transition well (post-MSE 0.019), so the regulariser's slight extra constraint does more harm than good. This highlights that the MDL approach is most beneficial when the regime switch crosses a large dynamical gap. Principled complexity control and dynamic expressiveness are complementary, not competing.
05 · What Does τ Cost? Solvers, Compilers, and Ecosystems
The same LTC dynamics, implemented in three frameworks, expose fundamentally different compilation and execution trade-offs.
We port the identical LTC formulation — same parameters, same hand-written Adam, same hyperparameters — across Julia, PyTorch, and JAX. Cross-framework alignment verified to max abs error < 10−4. What differs is everything beneath:
| Framework | Solver | AD Strategy | Compile |
|---|---|---|---|
| Julia | Tsit5 (adaptive) | ForwardDiffSens. | LLVM |
| PyTorch | RK4 (fixed-step) | torch.autograd | eager |
| JAX | odeint (adaptive) | jax.grad → XLA | XLA JIT |
Julia:
τ_val = cell.τ₀ .+ softplus.(cell.Wτ * x .+ cell.Uτ * I .+ cell.bτ) # τ(x, I)
f_val = tanh.(cell.Wf * x .+ cell.Uf * I .+ cell.bf) # f(x, I)
@. -x / τ_val + f_val # dx/dt
# Solver: Tsit5() · AD: ForwardDiffSensitivity()
PyTorch:
tau_val = self.tau0 + F.softplus(self.W_tau @ x + self.U_tau @ I + self.b_tau) # τ(x,I)
f_val = torch.tanh(self.W_f @ x + self.U_f @ I + self.b_f) # f(x,I)
return -x / tau_val + f_val # dx/dt
# Solver: RK4 (torchdiffeq) · AD: torch.autograd
JAX:
tau_val = cell.tau0 + softplus(cell.W_tau @ x + cell.U_tau @ inputs + cell.b_tau) # τ(x,I)
f_val = jnp.tanh(cell.W_f @ x + cell.U_f @ inputs + cell.b_f) # f(x,I)
return -x / tau_val + f_val # dx/dt
# Solver: jax.experimental.ode.odeint · AD: jax.grad → XLA
Solvers Are Not Interchangeable
Julia's DifferentialEquations.jl gives the programmer a solver-first interface. Choosing Tsit5() (adaptive 5th-order) is not an implementation detail — it determines step-size adaptation, error control, and the number of ltc_dynamics evaluations per time unit. With ForwardDiffSensitivity, gradients differentiate through the solver's internal steps, faithfully reflecting its numerical decisions. When the solver takes finer steps near a sharp transient, gradient computation automatically becomes denser there too. The solver is a first-class citizen in the computational graph.
PyTorch's torchdiffeq uses fixed-step RK4 in our setup — simpler and predictable, but without adaptive step-size control. The AD path records operations at the Python level and replays them backward; the solver's internal structure remains largely opaque to the gradient tape.
JAX's jax.experimental.ode.odeint occupies a middle ground: adaptive integration, and because JAX traces the entire computation including solver control flow, the resulting JAXPR captures loops, step-size decisions, and all — enabling XLA to compile the full forward-backward-optimise loop as a single executable.
PyTorch Validation
The PyTorch port reproduces the same qualitative pattern as Julia. For low damping (ζ = 0.1), the model fails to capture high-frequency oscillatory components and instead learns a smoother envelope. As damping increases (ζ = 0.3–0.5), predictions improve but some oscillation detail is still missing. For ζ = 0.8 and 1.2, the model matches ground truth closely. This confirms the behaviour is model-inherent — the limitation comes from hidden dimension (H = 8) and numerical smoothing introduced by ODE integration, not from any framework-specific artefact.
Inside the JAX Compilation Pipeline
In JAX, the LTC training process is not executed step by step by the Python interpreter. Python/JAX functions are first represented as JAXPR (a functional intermediate representation), then lowered to StableHLO (an MLIR dialect), and finally compiled by XLA into a native executable.
Python/JAX → JAXPR. When jax.jit or jax.grad is invoked, JAX traces the function with abstract values, recording the full dataflow and control-flow structure. The forward LTC rollout produces a JAXPR containing dynamic_slice operations (fetching the current input I from the sequence at the current solve time), scan loops (time-stepping and sequential accumulation), while loops (adaptive solver step-size control), and the neural network primitives (dot_general, tanh, exp, log).
After jax.grad, the program expands substantially:
| Primitive | Forward | After AD |
|---|---|---|
dot_general | 16 | 55 |
scan | 6 | 14 |
while | 1 | 2 |
tanh | 3 | 7 |
transpose | 12 | 40 |
This is not "a gradient module hanging next to the forward graph." AD expands the program into: forward residue + local derivative computation (1 − tanh², τ−2) + weight-gradient propagation + extra time-axis scans. A single forward matrix multiplication often gives rise to multiple backward matrix multiplications; transpose grows along with dot_general as gradients are routed back to weight matrices.
JAXPR → StableHLO. JAX lowers the differentiated program into StableHLO, a compiler-friendly intermediate representation. High-level primitives become explicit control flow, function calls, and tensor operations — the compiler does not see an abstract notion of ODE solving, but explicit stablehlo.while loops, call @_odeint_wrapper, stablehlo.dot_general, stablehlo.tanh, and time-indexing logic. Forward rollout occupies 665 lines with 5 while loops; the gradient program expands to 1599 lines with 11 while loops.
StableHLO → XLA Executable. XLA compiles this into a native executable via LLVM on CPU. One-time compilation cost: 1918 ms. After that, each train step: 19 ms — a ~95× speedup over eager execution.
Full JAXPR and StableHLO analysis →
Profiling
| Framework | Mode | Per-step time | Notes |
|---|---|---|---|
| Julia (Tsit5) | steady-state | 51.8 ms | Adaptive solver; no JIT overhead |
| JAX | eager (no JIT) | 1820 ms | ~35× slower than Julia |
| JAX | after XLA JIT | 19.3 ms | ~95× faster than eager; one-time compile: 1.9 s |
| PyTorch | RK4 | 217.6 ms | Fixed-step solver |
Julia's SciML ecosystem provides the most natural expression — solver and AD are first-class citizens, and 51.8 ms reflects a mature runtime with no compilation overhead at steady state. JAX's value proposition is different: a visible one-time tax (~1.9 s) buys a compiled steady-state step of 19 ms — faster than Julia for this specific workload. Without JIT, JAX runs at 1820 ms, demonstrating that the XLA compilation path is not optional but essential. JAX's performance characteristic is not simply "how fast is one step," but whether paying a compilation tax upfront is worth the sustained speedup.
06 · Closing the Loop: From Physics to Silicon
This post traced a complete arc — from a physical equation to a trained model to a compiled executable — and at every level, design choices propagated consequences.
At the physics level, encoding temporal scale as a dynamic variable τ(x, I) buys something specific: not dominance on any single metric, but robust adaptability across diverse physical regimes. LTC never fails catastrophically. Its τ exhibits physically meaningful self-regulation — slowing down autonomously in response to unfamiliar dynamics. When combined with MDL-inspired complexity control, post-switch performance improves by 50–99% on the hardest transitions, confirming that principled regularisation and dynamic expressiveness are complementary.
At the framework level, the "same model" is not really the same once compiled. Julia's SciML ecosystem treats the ODE solver as a first-class participant in the computational graph — the programmer chooses a solver, an AD strategy, and a sensitivity algorithm, and these choices are all visible and composable. JAX's tracing-based compilation exposes a different structure: the entire forward-backward-optimisation loop is captured as JAXPR, lowered to StableHLO, and compiled by XLA into a single executable — paying a one-time tax for dramatic steady-state speedup. The JAXPR analysis makes this concrete: AD expands the forward program 3.4× in matrix operations, and StableHLO shows the compiler receiving not a static feedforward graph but a differentiable program with control flow, solver calls, and time-indexing logic. PyTorch provides the most familiar workflow and largest ecosystem, but its eager execution model keeps the solver relatively opaque to the gradient tape.
For EE5311, the broader takeaway: differentiable programming is not just automatic differentiation plus a loss function. It is a stack — physics prior, model architecture, regulariser, ODE solver, AD strategy, compiler. Each layer is a design choice, and each choice propagates consequences through every layer above and below it. The damped harmonic oscillator is a simple system, but the methodology — systematic ablation across physical regimes and computational frameworks — scales to any setting where continuous-time dynamics meet gradient-based learning.
Full code, data, and results: github.com/eeessay/physics-to-silicon
References
- Chen, R. T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). Neural Ordinary Differential Equations. NeurIPS.
- Hasani, R., Lechner, M., Amini, A., et al. (2021). Liquid Time-constant Networks. AAAI.
- Beer, R. D. (1995). On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior, 3(4).
- Grünwald, P. D. (2007). The Minimum Description Length Principle. MIT Press.
- Rackauckas, C., & Nie, Q. (2017). DifferentialEquations.jl — A Performant and Feature-Rich Ecosystem for Solving Differential Equations in Julia. JORS.
- Kidger, P. (2022). On Neural Differential Equations. DPhil Thesis, University of Oxford.
Contributions
Gong Chang implemented the canonical Julia LTC reference, conducted Julia profiling and type-stability analysis, designed the MDL-inspired τ regularisation experiments, and led the writing and final integration of the blog.
Yan YiHan implemented the Neural ODE and CTRNN baselines, ran the core evaluation experiments including generalisation, noise robustness, and MSE comparisons, and authored the corresponding blog sections.
Zhang XiRui implemented and validated the JAX LTC port, profiled recompilation cost under static compilation using jax.make_jaxpr, and authored the JAX subsection.
Shen YaHeng implemented and validated the PyTorch LTC port, profiled memory overhead and dynamic graph rebuild cost using torch.profiler, and authored the PyTorch subsection.
Yu XinTong built and deployed the blog website, integrating all figures, LaTeX rendering, code highlighting, and interactive dynamic visualisations.