4.32. Wing Loss for Precise Regression

This example shows how to package the Wing loss — a loss function from face landmark localization [Feng2018] — as a gradient-based UDF. The full model is

\[\begin{array}{ll} \min\limits_\beta & \displaystyle\sum_{i=1}^{n} w \cdot \ln\!\Bigl(1 + \frac{\sqrt{(a_i^\top \beta - b_i)^2 + \delta^2}}{\varepsilon}\Bigr). \end{array}\]

Here \(w\) controls the overall loss magnitude, \(\varepsilon\) sets the linear-to-logarithmic transition scale, and \(\delta > 0\) smooths the loss at the origin. Wing loss is designed to amplify attention to small errors while handling large errors gracefully — useful when precision at every data point matters.

The function value returned by UDFBase.eval() sums the Wing loss over all residuals. To keep the function smooth at \(r = 0\), the squared residual is regularized by \(\delta^2\) inside the square root:

\[f(r) = \sum_i w \cdot \ln\!\Bigl(1 + \frac{\sqrt{r_i^2 + \delta^2}}{\varepsilon}\Bigr).\]

For small residuals \(|r| \ll \varepsilon\), this behaves like \(\frac{w}{\varepsilon}|r|\) (steep gradient), and for large residuals \(|r| \gg \varepsilon\), it grows as \(w \cdot \ln(|r|/\varepsilon)\) (logarithmic, slower than L2). So eval computes:

def eval(self, arglist):
    r = np.asarray(arglist[0], dtype=float)
    s = np.sqrt(r ** 2 + self.delta ** 2)
    return float(np.sum(self.w * np.log(1 + s / self.eps)))

The gradient returned by UDFBase.grad() uses the chain rule through the square root and the logarithm:

\[\nabla f(r)_i = w \cdot \frac{r_i}{\sqrt{r_i^2 + \delta^2}} \cdot \frac{1}{\varepsilon + \sqrt{r_i^2 + \delta^2}}.\]

Near zero, the first factor \(r/\sqrt{r^2+\delta^2}\) is approximately \(r/\delta\) (linear), and the second factor is approximately \(1/(\varepsilon + \delta)\) (constant), so the gradient is proportional to \(r\) with a steep coefficient \(w/(\delta(\varepsilon+\delta))\). This steep gradient near zero is what makes Wing loss “pay more attention” to reducing small errors. The implementation:

def grad(self, arglist):
    r = np.asarray(arglist[0], dtype=float)
    s = np.sqrt(r ** 2 + self.delta ** 2)
    return [self.w * (r / s) / (self.eps + s)]

The UDFBase.arguments() method returns the single expression this UDF depends on. The hyperparameters \(w\), \(\varepsilon\), and \(\delta\) are stored as instance attributes and do not appear in arguments:

def arguments(self):
    return [self.arg]

Complete runnable example:

import admm
import numpy as np

class WingLoss(admm.UDFBase):
    """Wing loss: f(r) = sum(w * ln(1 + sqrt(r^2 + delta^2) / eps)).

    Properties:
        - Near zero: gradient ≈ w*r / (delta*(eps+delta)) — steep
        - Large |r|: grows as w*ln(|r|/eps) — logarithmic
        - Everywhere smooth (delta > 0)

    Parameters
    ----------
    arg : admm.Var or expression
        The residual vector.
    w : float
        Loss magnitude.
    eps : float
        Transition scale.
    delta : float
        Smoothing at zero.
    """
    def __init__(self, arg, w=10.0, eps=2.0, delta=0.01):
        self.arg = arg
        self.w = w
        self.eps = eps
        self.delta = delta

    def arguments(self):
        return [self.arg]

    def eval(self, arglist):
        r = np.asarray(arglist[0], dtype=float)
        s = np.sqrt(r ** 2 + self.delta ** 2)
        return float(np.sum(self.w * np.log(1 + s / self.eps)))

    def grad(self, arglist):
        r = np.asarray(arglist[0], dtype=float)
        s = np.sqrt(r ** 2 + self.delta ** 2)
        return [self.w * (r / s) / (self.eps + s)]

# Regression data with moderate outliers
np.random.seed(2026)
n, p = 60, 5
A = np.random.randn(n, p)
beta_true = np.array([1.5, -0.8, 2.0, 0.3, -1.2])
b = A @ beta_true + 0.3 * np.random.randn(n)
outlier_idx = np.random.choice(n, size=6, replace=False)
b[outlier_idx] += np.random.choice([-1, 1], size=6) * np.random.uniform(3, 6, size=6)

model = admm.Model()
beta = admm.Var("beta", p)
model.setObjective(WingLoss(A @ beta - b, w=10.0, eps=2.0))
model.setOption(admm.Options.admm_max_iteration, 5000)
model.optimize()

print(" * status:", model.StatusString)            # Expected: SOLVE_OPT_SUCCESS
print(" * beta:", np.asarray(beta.X))              # Expected: ≈ [1.5, -0.8, 2, 0.3, -1.2]
print(" * Wing error:", np.linalg.norm(np.asarray(beta.X) - beta_true))  # Expected: ≈ 0.13

This example is available as a standalone script in the examples/ folder of the ADMM repository:

python examples/udf_grad_wing_loss.py

In this concrete example, Wing loss achieves \(\|\beta_{\text{Wing}} - \beta_{\text{true}}\| \approx 0.13\) compared to \(\|\beta_{\text{OLS}} - \beta_{\text{true}}\| \approx 0.63\) (OLS), a 5x improvement. The median absolute residual is also halved (0.20 vs 0.43), confirming that Wing loss prioritizes reducing residuals at every observation rather than just the largest ones.

[Feng2018]

Z.-H. Feng, J. Kittler, M. Awais, P. Huber, X.-J. Wu. “Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks.” CVPR, 2018.