FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

Abstract

Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field.

Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution.

Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.

Method

1. What is Generalized Schrödinger Bridge?

The Schrödinger Bridge Problem (SBP) finds the most likely stochastic process connecting two distributions. Given initial distribution $\mu_0$ and terminal distribution $\mu_1$, it solves:

$$\min_{\mathbb{P}} \quad \mathcal{D}(\mathbb{P} \| \mathbb{P}^{\mathrm{ref}}) \quad \text{s.t.} \quad \mathbb{P}_{t=0} = \mu_0, \quad \mathbb{P}_{t=1} = \mu_1$$

Find the path measure closest to reference while matching both boundary distributions.

The Generalized Schrödinger Bridge (GSB) relaxes the hard terminal constraint to a soft potential. This is crucial: we no longer need samples from the target distribution $\mu_1$, only a potential function $\mathcal{G}$ that scores terminal states.

$$\min_{\mathbb{P}} \quad \mathcal{J}_{\mathrm{GSB}}(\mathbb{P}) = \underbrace{\mathcal{D}(\mathbb{P} \| \mathbb{P}^{\mathrm{ref}})}_{\text{Stay close to reference}} + \underbrace{\mathbb{E}_{X_1 \sim \mathbb{P}} \left[ \mathcal{G}(X_1) \right]}_{\text{Terminal potential}} \quad \text{s.t.} \quad \mathbb{P}_{t=0} = \mu_0$$

Only the initial distribution is constrained; the terminal distribution is guided by potential $\mathcal{G}$.

Note: The divergence $\mathcal{D}$ depends on the process type. For SDE, it is KL divergence; for ODE, other measures apply. Our framework handles both.

The optimal solution has a closed form—the terminal distribution is an exponential tilting of the reference:

$$p^*(X_1) \propto \mu_1^{\mathrm{ref}}(X_1) \cdot \exp\left(-\mathcal{G}(X_1)\right)$$

where $\mu_1^{\mathrm{ref}}$ is the terminal distribution of the reference process.

2. Why Model Maximum Entropy RL as GSB?

Let's start with what we want. Maximum Entropy RL seeks policies that maximize return while maintaining high entropy:

$$\max_\pi \quad \mathbb{E}_\pi[R] + \alpha \mathcal{H}(\pi(\cdot|s))$$

The optimal solution is the Boltzmann policy: $\pi^*(a|s) \propto \exp(Q(s,a)/\alpha)$

This is elegant in theory. But here's the problem: for iterative generative policies like diffusion models, we can't compute $\log \pi(a|s)$—actions emerge from a multi-step stochastic process. The entropy term becomes intractable.

Now, recall the GSB optimal solution from Section 1:

$$p^*(X_1) \propto \mu_1^{\mathrm{ref}}(X_1) \cdot \exp\left(-\mathcal{G}(X_1)\right)$$

Notice something? These have the same form! If we choose:

$\mu_1^{\mathrm{ref}} = \text{uniform}$ (high-entropy reference)
$\mathcal{G}(a) = -Q(s,a)/\alpha$ (potential favors high Q-values)

Then GSB directly gives us the Boltzmann policy—without ever computing $\log \pi(a|s)$.

Why GSB Fits RL Perfectly

Here's the key insight: in RL, we don't have samples from the optimal policy $\pi^*$—we only have a scoring function (Q-value). This matches GSB exactly: no target samples needed, just a potential $\mathcal{G}$. The path divergence term implicitly regularizes entropy through the generation process itself.

This gives us three benefits:

Likelihood-free: No need to compute intractable $\log \pi(a|s)$ for generative policies.
Natural entropy: Staying close to high-entropy reference preserves stochasticity.
Principled: GSB theory guarantees the terminal distribution matches Boltzmann form.

3. From Path Constraint to Kinetic Energy

So far so good. But the GSB objective involves a path constraint—staying close to the reference process. This is defined on the space of entire trajectories, which sounds abstract and hard to optimize. Can we turn it into something concrete?

Consider a controlled process with drift $u_\theta$ and a reference process with zero drift (pure noise):

$$\text{Policy: } dX_\tau = u_\theta(s, \tau, X_\tau) d\tau + \sigma dW_\tau \qquad \text{Reference: } dX_\tau = \sigma dW_\tau$$

The two processes share the same noise but differ in drift. How far apart are they? This depends on whether we work with SDEs or ODEs:

SDE Case ($\sigma > 0$): Girsanov's Theorem

For stochastic processes, Girsanov's theorem gives an exact identity: the KL divergence on path space equals the expected kinetic energy of the drift:

$$\mathcal{D}_{\mathrm{KL}}(\mathbb{P}^\theta \| \mathbb{P}^{\mathrm{ref}}) = \frac{1}{2\sigma^2} \mathbb{E}_{\mathbb{P}^\theta} \left[ \int_0^1 \|u_\theta(s, \tau, X_\tau)\|^2 d\tau \right]$$

This is not an approximation—it is an equality. Minimizing kinetic energy is exactly equivalent to minimizing path-space KL divergence.

ODE Case ($\sigma \to 0$): Wasserstein Bound

For deterministic flows, the KL interpretation no longer applies. Instead, kinetic energy upper-bounds the squared $W_2$ distance between the induced terminal distribution and the reference:

$$W_2^2(p_1^\theta, \mu_1^{\mathrm{ref}}) \leq \mathbb{E}\left[\int_0^1 \|u_\theta\|^2 d\tau\right]$$

This acts as a geometric proximity constraint: penalizing kinetic energy prevents aggressive, large-scale transport that would concentrate probability mass. While this does not directly bound entropy, it empirically discourages rapid mode collapse and promotes broad action coverage.

In both cases, kinetic energy provides a tractable, computable measure of how far the policy strays from the high-entropy reference:

$$\mathcal{E}_{\mathrm{kinetic}} = \mathbb{E}_{\mathbb{P}^\theta} \left[ \int_0^1 \|u_\theta(s, \tau, X_\tau)\|^2 d\tau \right]$$

SDE: equals path-space KL divergence. ODE: upper-bounds $W_2$ distance to reference.

An Intuitive View

Think of it this way: kinetic energy $\|u_\theta\|^2$ measures the "effort" to transport noise to actions, while the terminal potential $-Q(s,a)$ acts like a potential energy landscape guiding where actions should land. The policy seeks paths that minimize total action—balancing efficient transport (low kinetic energy) against reaching high-reward regions (low potential energy). This echoes the principle of least action in physics: among all possible paths, nature chooses the one that minimizes the integrated difference between kinetic and potential energy.

Kinetic Energy Regularization Encourages Exploration. (Top) Without regularization: high velocity collapses to a single mode. (Bottom) FLAC: penalizing kinetic energy preserves the multimodal distribution.

4. The FLAC Objective

Now we have all the pieces. Combining GSB formulation + RL potential + kinetic energy regularization, we arrive at the FLAC objective:

$$\min_{\theta} J_{\text{FLAC}}(\theta) = \mathbb{E}_{\mathbb{P}^\theta} \left[ \underbrace{\alpha \int_0^1 \frac{1}{2} \left\| u_\theta(s, \tau, X_\tau) \right\|^2 d\tau}_{\text{Kinetic energy (entropy proxy)}} - \underbrace{Q(s, X_1)}_{\text{Return}} \right]$$

Minimize kinetic energy + Maximize return. Fully tractable—no density evaluation needed.

That's it. We can sample trajectories, compute kinetic energy along the path, and optimize with standard gradient descent. No likelihood computation required.

Energy-Regularized Policy Iteration

We derive energy-regularized Bellman operators with $\gamma$-contraction guarantees, extending classical policy iteration to generative policies.

Automatic Energy Tuning

A Lagrangian dual mechanism learns $\alpha$ to maintain target energy level $E_{\mathrm{tgt}}$, adapting exploration automatically.

Pathwise Gradient Estimation

Differentiable ODE solvers enable end-to-end gradient flow, compatible with standard off-policy replay buffers.

FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

Abstract

Method

1. What is Generalized Schrödinger Bridge?

2. Why Model Maximum Entropy RL as GSB?

Why GSB Fits RL Perfectly

3. From Path Constraint to Kinetic Energy

SDE Case ($\sigma > 0$): Girsanov's Theorem

ODE Case ($\sigma \to 0$): Wasserstein Bound

An Intuitive View

4. The FLAC Objective

Energy-Regularized Policy Iteration

Automatic Energy Tuning

Pathwise Gradient Estimation

Experimental Results

Limitations & Future Directions

Acknowledgments

Citation