8. Safe and Strong Screening for Generalized LASSO

Safe [GVR11] and strong [TBF+12] scoring rules may be used to filter out variables for LASSO regression. These approaches are a type of feature selection and involve inner-products to reduce the feature space. The safe scoring rule is defined as follows.

\(|x_j^T y| < \lambda - ||x_j||_2 ||y||_2 \dfrac{\lambda_{\mathrm{max}} - \lambda}{\lambda_{\mathrm{max}}}\)

The strong scoring rule tends to remove more variables and is defined as follows.

\(|x_j^T y| < 2\lambda - \lambda_{\mathrm{max}}\)

For these inequalities,

\(x_j\) is the vector of values for the j-th,
\(y\) is the vector of the dependent variable,
\(\lambda_{\mathrm{max}} = \mathrm{max}_i |x_i^T y|\), and
\(\lambda\) is the regularization parameter.

8.1. Data

Let’s simulate some data.

The variables \(X_1, X_2, X_3, X_4, X_5\) are sampled as follows.

\(X_1 \sim \mathcal{B}(1, 0.8)\)
\(X_2 \sim \mathcal{B}(1, 0.4)\)
\(X_3 \sim \mathcal{B}(1, 0.6)\)
\(X_4 \sim \mathcal{B}(1, 0.3)\)
\(X_5 \sim \mathcal{B}(1, 0.2)\)

The dependent variable \(y\) is sampled as follows.

\(z = 2.0 + 0.5 X_1 + 0.8 X_2\)
\(p = \dfrac{1.0}{1.0 + \exp(-z)}\)
\(y \sim \mathcal{B}(1, p)\)

There are 10,000 sampled data points.

[1]:

import numpy as np
import random
from scipy.stats import binom

random.seed(37)
np.random.seed(37)

N = 10_000

x1 = np.random.binomial(1, 0.8, N)
x2 = np.random.binomial(1, 0.7, N)
x3 = np.random.binomial(1, 0.5, N)
x4 = np.random.binomial(1, 0.3, N)
x5 = np.random.binomial(1, 0.2, N)

X = np.hstack((
    x1.reshape(-1, 1),
    x2.reshape(-1, 1),
    x3.reshape(-1, 1),
    x4.reshape(-1, 1),
    x5.reshape(-1, 1)
))

z = 2.0 + 0.5 * x1 + 0.8 * x2
p = 1.0 / (1.0 + np.exp(-z))
y = binom.rvs(1, p)

8.2. Feature selection

Let’s compute what we need for the inequalities.

[2]:

x_ip = np.abs(X.T.dot(y))                    # inner-product |x_i^T y|
lambda_max = np.max(x_ip)                    # maximum value of inner-products
lambda_v = lambda_max - (0.1 * lambda_max)   # lambda is selected to be lambda_max - (0.1 lambda_max)

x_norm = np.array([np.linalg.norm(X[:,i], 2) for i in range(X.shape[1])])  # l2 norms of each variable/column
y_norm = np.linalg.norm(y, 2)                                              # l2 norm of y

[3]:

lambda_v

[3]:

6833.7

[4]:

lambda_max

[4]:

[5]:

x_ip

[5]:

array([7593, 6770, 4740, 2834, 1890], dtype=int64)

[6]:

x_norm

[6]:

array([89.35323161, 83.98214096, 70.79548008, 54.91812087, 44.754888  ])

[7]:

y_norm

[7]:

97.22139682189307

8.2.1. Safe scoring rule

\(\lambda - ||x_j||_2 ||y||_2 \dfrac{\lambda_{\mathrm{max}} - \lambda}{\lambda_{\mathrm{max}}}\)

[8]:

lambda_v - x_norm * y_norm * ((lambda_max - lambda_v)/(lambda_max))

[8]:

array([5964.99540119, 6017.21389479, 6145.41645378, 6299.77835781,
       6398.58672739])

The safe scoring rule inequality.

\(|x_j^T y| < \lambda - ||x_j||_2 ||y||_2 \dfrac{\lambda_{\mathrm{max}} - \lambda}{\lambda_{\mathrm{max}}}\)

As you can see, \(X_1\) and \(X_2\) are kept, but \(X_3\), \(X_4\) and \(X_5\) are removed by the inequality.

[9]:

x_ip < lambda_v - x_norm * y_norm * ((lambda_max - lambda_v)/(lambda_max))

[9]:

array([False, False,  True,  True,  True])

8.2.2. Strong scoring rule

\(2\lambda - \lambda_{\mathrm{max}}\)

[10]:

2 * lambda_v - lambda_max

[10]:

6074.4

The strong scoring rule inequality.

\(|x_j^T y| < 2\lambda - \lambda_{\mathrm{max}}\)

Again, as you can see, \(X_1\) and \(X_2\) are kept, but \(X_3\), \(X_4\) and \(X_5\) are removed by the inequality.

[11]:

x_ip < 2 * lambda_v - lambda_max

[11]:

array([False, False,  True,  True,  True])