7. Safe and Strong Screening for Generalized LASSO

Safe [GVR11] and strong [TBF+12] scoring rules may be used to filter out variables for LASSO regression. These approaches are a type of feature selection and involve inner-products to reduce the feature space. The safe scoring rule is defined as follows.

\(|x_j^T y| < \lambda - ||x_j||_2 ||y||_2 \dfrac{\lambda_{\mathrm{max}} - \lambda}{\lambda_{\mathrm{max}}}\)

The strong scoring rule tends to remove more variables and is defined as follows.

\(|x_j^T y| < 2\lambda - \lambda_{\mathrm{max}}\)

For these inequalities,

  • \(x_j\) is the vector of values for the j-th,

  • \(y\) is the vector of the dependent variable,

  • \(\lambda_{\mathrm{max}} = \mathrm{max}_i |x_i^T y|\), and

  • \(\lambda\) is the regularization parameter.

7.1. Data

Let’s simulate some data.

The variables \(X_1, X_2, X_3, X_4, X_5\) are sampled as follows.

  • \(X_1 \sim \mathcal{B}(1, 0.8)\)

  • \(X_2 \sim \mathcal{B}(1, 0.4)\)

  • \(X_3 \sim \mathcal{B}(1, 0.6)\)

  • \(X_4 \sim \mathcal{B}(1, 0.3)\)

  • \(X_5 \sim \mathcal{B}(1, 0.2)\)

The dependent variable \(y\) is sampled as follows.

  • \(z = 2.0 + 0.5 X_1 + 0.8 X_2\)

  • \(p = \dfrac{1.0}{1.0 + \exp(-z)}\)

  • \(y \sim \mathcal{B}(1, p)\)

There are 10,000 sampled data points.

[1]:
import numpy as np
import random
from scipy.stats import binom

random.seed(37)
np.random.seed(37)

N = 10_000

x1 = np.random.binomial(1, 0.8, N)
x2 = np.random.binomial(1, 0.7, N)
x3 = np.random.binomial(1, 0.5, N)
x4 = np.random.binomial(1, 0.3, N)
x5 = np.random.binomial(1, 0.2, N)

X = np.hstack((
    x1.reshape(-1, 1),
    x2.reshape(-1, 1),
    x3.reshape(-1, 1),
    x4.reshape(-1, 1),
    x5.reshape(-1, 1)
))

z = 2.0 + 0.5 * x1 + 0.8 * x2
p = 1.0 / (1.0 + np.exp(-z))
y = binom.rvs(1, p)

7.2. Feature selection

Let’s compute what we need for the inequalities.

[2]:
x_ip = np.abs(X.T.dot(y))                    # inner-product |x_i^T y|
lambda_max = np.max(x_ip)                    # maximum value of inner-products
lambda_v = lambda_max - (0.1 * lambda_max)   # lambda is selected to be lambda_max - (0.1 lambda_max)

x_norm = np.array([np.linalg.norm(X[:,i], 2) for i in range(X.shape[1])])  # l2 norms of each variable/column
y_norm = np.linalg.norm(y, 2)                                              # l2 norm of y
[3]:
lambda_v
[3]:
6833.7
[4]:
lambda_max
[4]:
7593
[5]:
x_ip
[5]:
array([7593, 6770, 4740, 2834, 1890])
[6]:
x_norm
[6]:
array([89.35323161, 83.98214096, 70.79548008, 54.91812087, 44.754888  ])
[7]:
y_norm
[7]:
97.22139682189307

7.2.1. Safe scoring rule

\(\lambda - ||x_j||_2 ||y||_2 \dfrac{\lambda_{\mathrm{max}} - \lambda}{\lambda_{\mathrm{max}}}\)

[8]:
lambda_v - x_norm * y_norm * ((lambda_max - lambda_v)/(lambda_max))
[8]:
array([5964.99540119, 6017.21389479, 6145.41645378, 6299.77835781,
       6398.58672739])

The safe scoring rule inequality.

\(|x_j^T y| < \lambda - ||x_j||_2 ||y||_2 \dfrac{\lambda_{\mathrm{max}} - \lambda}{\lambda_{\mathrm{max}}}\)

As you can see, \(X_1\) and \(X_2\) are kept, but \(X_3\), \(X_4\) and \(X_5\) are removed by the inequality.

[9]:
x_ip < lambda_v - x_norm * y_norm * ((lambda_max - lambda_v)/(lambda_max))
[9]:
array([False, False,  True,  True,  True])

7.2.2. Strong scoring rule

\(2\lambda - \lambda_{\mathrm{max}}\)

[10]:
2 * lambda_v - lambda_max
[10]:
6074.4

The strong scoring rule inequality.

\(|x_j^T y| < 2\lambda - \lambda_{\mathrm{max}}\)

Again, as you can see, \(X_1\) and \(X_2\) are kept, but \(X_3\), \(X_4\) and \(X_5\) are removed by the inequality.

[11]:
x_ip < 2 * lambda_v - lambda_max
[11]:
array([False, False,  True,  True,  True])