Box–Cox Transformation

Key idea

Box–Cox transformation is a family of power transformations used to transform a variable into a more model-friendly form.

Its main goals are:

It is commonly used before regression, statistical modeling, and occasionally factor preprocessing.

For a positive variable $y>0$:

\[y^{(\lambda)}= \begin{cases} \frac{y^\lambda-1}{\lambda}, & \lambda\neq0\ \ln(y), & \lambda=0 \end{cases}\]

where:

Common values:

As $\lambda$ decreases, large observations are compressed more aggressively.

Original data:

\[[1,2,4,8,16]\]

If $\lambda=0$:

\[[0,0.69,1.39,2.08,2.77]\]

(log transform)

If $\lambda=0.5$:

\[[1,1.41,2,2.83,4]\]

(square-root transform)

Large values become less dominant.

Typically choose $\lambda$ using Maximum Likelihood Estimation (MLE):

\[\lambda^* = \arg\max_\lambda L(\lambda)\]

The objective is to find the transformation that makes the transformed data as close to Gaussian as possible.

Example:

python from scipy.stats import boxcox x_transformed, lam = boxcox(x) print(lam)

Box–Cox requires:

\[y>0\]

If data contains zero or negative values:

python x = x - x.min() + 1

Box–Cox is less common than rank normalization and z-score normalization in cross-sectional equity models.

Typical applications:

Example preprocessing pipeline:

\[\text{Raw Factor} \rightarrow \text{Winsorize} \rightarrow \text{Box–Cox} \rightarrow \text{Z-score} \rightarrow \text{Neutralize}\]

In cross-sectional factor research, Box–Cox is usually estimated independently for each date rather than globally across time.