# 2. Pricing Elasticity of Demand Modeling

Pricing Elasticity of Demand (PED) is a measure of how responsive demand (or quantity sold) is to price change. When demand (y-axis) is plotted against price (x-axis), a demand curve is drawn. From the demand curve, we can compute the PED using pairs of points or at a particular point. Typically, elasticity changes along every point (or between every pair of points) on the demand curve. When we assume that elasticity remains constant along the demand curve, we can use regression techniques to estimate the constant elasticity. Let’s take a look to see how regression is used to estimate the constant elasticity.

This data is taken from Kaggle.

[1]:

import pandas as pd
import numpy as np

def get_text(r):
product_category_name = r['product_category_name']

.assign(
month_year=lambda d: pd.to_datetime(d['month_year']),
year=lambda d: d['month_year'].dt.year,
month=lambda d: d['month_year'].dt.month,
weekend=lambda d: d['weekday'].apply(lambda x: 1 if x >= 5 else 0),
text=lambda d: d['product_category_name'].apply(lambda s: ' '.join(s.split('_')))
)
df.shape

[1]:

(676, 31)

[2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676 entries, 0 to 675
Data columns (total 31 columns):
#   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
0   product_id                  676 non-null    object
1   product_category_name       676 non-null    object
2   month_year                  676 non-null    datetime64[ns]
3   qty                         676 non-null    int64
4   total_price                 676 non-null    float64
5   freight_price               676 non-null    float64
6   unit_price                  676 non-null    float64
7   product_name_lenght         676 non-null    int64
8   product_description_lenght  676 non-null    int64
9   product_photos_qty          676 non-null    int64
10  product_weight_g            676 non-null    int64
11  product_score               676 non-null    float64
12  customers                   676 non-null    int64
13  weekday                     676 non-null    int64
14  weekend                     676 non-null    int64
15  holiday                     676 non-null    int64
16  month                       676 non-null    int64
17  year                        676 non-null    int64
18  s                           676 non-null    float64
19  volume                      676 non-null    int64
20  comp_1                      676 non-null    float64
21  ps1                         676 non-null    float64
22  fp1                         676 non-null    float64
23  comp_2                      676 non-null    float64
24  ps2                         676 non-null    float64
25  fp2                         676 non-null    float64
26  comp_3                      676 non-null    float64
27  ps3                         676 non-null    float64
28  fp3                         676 non-null    float64
29  lag_price                   676 non-null    float64
30  text                        676 non-null    object
dtypes: datetime64[ns](1), float64(15), int64(12), object(3)
memory usage: 163.8+ KB


## 2.2. Visualize pricing elasticity of demand

There’s already a lot of exploratory data analysis (EDA) and visualization on the data on Kaggle, and so we will not plot too much. Below, we are demand (quantity) versus price over all products. You can see that there tends to be a downward slope. Also note, we take the log of the demand and price.

[3]:

df[['unit_price', 'qty']] \
.groupby(['unit_price']) \
.sum() \
.reset_index() \
.assign(
unit_price=lambda d: np.log(d['unit_price']),
qty=lambda d: np.log(d['qty'])
) \
.plot(kind='scatter', x='unit_price', y='qty')

[3]:

<Axes: xlabel='unit_price', ylabel='qty'>


## 2.3. Split into training/testing

It seems that there is data in 2017 for January and also in 2018 for January. Let’s use the 2017 data for training and the 2018 data for testing/validation.

[4]:

df['month_year'].value_counts().sort_index()

[4]:

2017-01-01     2
2017-01-02     9
2017-01-03    13
2017-01-04    15
2017-01-05    20
2017-01-06    25
2017-01-07    33
2017-01-08    37
2017-01-09    36
2017-01-10    43
2017-01-11    44
2017-01-12    44
2018-01-01    48
2018-01-02    49
2018-01-03    50
2018-01-04    48
2018-01-05    40
2018-01-06    42
2018-01-07    40
2018-01-08    38
Name: month_year, dtype: int64

[5]:

df_tr = df[df['month_year'] <= '2017-12-31']
df_te = df[df['month_year'] >= '2018-01-01']

df_tr.shape, df_te.shape

[5]:

((321, 31), (355, 31))


## 2.4. Simple log-log model

An easy way to model the relationship between price and quantity is using log-log regression. Note that the coefficient associated with price is the estimated constant elasticity.

$$\log{Q} \sim \log{P}$$

[6]:

Xy_tr = df_tr.groupby(['unit_price']) \
.agg(qty_sum=pd.NamedAgg(column='qty', aggfunc='sum')) \
.assign(qty_sum=lambda d: np.log(d['qty_sum'])) \
.reset_index() \
.assign(unit_price=lambda d: np.log(d['unit_price']))

Xy_te = df_te.groupby(['unit_price']) \
.agg(qty_sum=pd.NamedAgg(column='qty', aggfunc='sum')) \
.assign(qty_sum=lambda d: np.log(d['qty_sum'])) \
.reset_index() \
.assign(unit_price=lambda d: np.log(d['unit_price']))

X_tr = Xy_tr[['unit_price']]
X_te = Xy_te[['unit_price']]

y_tr = Xy_tr['qty_sum']
y_te = Xy_te['qty_sum']

X_tr.shape, X_te.shape, y_tr.shape, y_te.shape

[6]:

((128, 1), (181, 1), (128,), (181,))


Since we want to have fun and suspect that quantity and price may have linear or non-linear relationships, we will try out a few regression models.

[7]:

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_absolute_error

def get_performance(y_tr, z_tr, y_te, z_te):
y_tr = np.exp(y_tr)
z_tr = np.exp(z_tr)

y_te = np.exp(y_te)
z_te = np.exp(z_te)

mae_tr = mean_absolute_error(y_tr, z_tr)
mae_te = mean_absolute_error(y_te, z_te)

mape_tr = np.mean(np.abs(y_tr - z_tr) / y_tr)
mape_te = np.mean(np.abs(y_te - z_te) / y_te)

wmape_tr = np.abs(y_tr - z_tr).sum() / y_tr.sum()
wmape_te = np.abs(y_te - z_te).sum() / y_te.sum()

s = pd.Series([
mae_tr,
mae_te,
mape_tr,
mape_te,
wmape_tr,
wmape_te,
mae_tr / (y_tr.max() - y_tr.min()),
mae_te / (y_te.max() - y_te.min()),
mae_tr / y_tr.std(),
mae_te / y_te.std(),
mae_tr / y_tr.mean(),
mae_te / y_te.mean()
], index=['mae_tr', 'mae_te',
'mape_tr', 'mape_te',
'wmape_tr', 'wmape_te',
'range_mae_tr', 'range_mae_te',
'std_mae_tr', 'std_mae_te',
'mean_mae_tr', 'mean_mae_te'])

return s

def get_performance_summary(models):
for m in models:
m.fit(X_tr, y_tr)

df = pd.DataFrame([get_performance(y_tr, m.predict(X_tr), y_te, m.predict(X_te))
for m in models], index=['OLS', 'RF', 'GBT'])
return df

loglog_models = [
LinearRegression(),
RandomForestRegressor(n_jobs=-1, random_state=37, n_estimators=25),
LGBMRegressor(n_jobs=-1, random_state=37, num_leaves=10, n_estimators=50, max_depth=2)
]

get_performance_summary(loglog_models)

[7]:

mae_tr mae_te mape_tr mape_te wmape_tr wmape_te range_mae_tr range_mae_te std_mae_tr std_mae_te mean_mae_tr mean_mae_te
OLS 24.973284 21.790472 1.452983 1.164152 0.735184 0.723551 0.046679 0.076998 0.392888 0.509473 0.735184 0.723551
RF 12.091052 31.956025 0.449581 2.286806 0.355946 1.061097 0.022600 0.112919 0.190221 0.747149 0.355946 1.061097
GBT 24.513070 22.900274 1.366331 1.335325 0.721636 0.760402 0.045819 0.080920 0.385648 0.535421 0.721636 0.760402

GBT seems to have the lowest validated MAE at 22.9.

[8]:

loglog_tr = pd.DataFrame({
'p': X_tr['unit_price'].values,
'q_true': y_tr.values,
'q_lr': loglog_models[0].predict(X_tr),
'q_rf': loglog_models[1].predict(X_tr),
'q_gb': loglog_models[2].predict(X_tr)
})

loglog_te = pd.DataFrame({
'p': X_te['unit_price'].values,
'q_true': y_te.values,
'q_lr': loglog_models[0].predict(X_te),
'q_rf': loglog_models[1].predict(X_te),
'q_gb': loglog_models[2].predict(X_te)
})

[9]:

import matplotlib.pyplot as plt

def do_scatter(df, title_suffix, p_field='p', q_fields=['q_true', 'q_lr', 'q_rf', 'q_gb']):
fig, axes = plt.subplots(1, 4, figsize=(15, 3.5))

for q_field, ax in zip(q_fields, np.ravel(axes)):
df.plot(
kind='scatter',
x=p_field,
y=q_field,
ax=ax,
title=f'{q_field} vs {p_field}, {title_suffix}'
)

fig.tight_layout()

def do_scatter_overlay(df_tr, df_te, p_field='p', q_fields=['q_true', 'q_lr', 'q_rf', 'q_gb']):
fig, axes = plt.subplots(1, 4, figsize=(15, 3.5))

for q_field, ax in zip(q_fields, np.ravel(axes)):
df_tr.plot(kind='scatter', x=p_field, y=q_field, ax=ax, color='r', label='tr')
df_te.plot(kind='scatter', x=p_field, y=q_field, ax=ax, color='b', label='te')
ax.set_title(f'{q_field} vs {p_field}')

fig.tight_layout()

[10]:

do_scatter(loglog_tr, 'TR')
do_scatter(loglog_te, 'TE')
do_scatter_overlay(loglog_tr, loglog_te)


## 2.5. Log-log model with features

In the log-log model with features, we’ll add additional features to see if it helps in modeling and the models’ performances.

$$\log{Q} \sim \log{P} + X_1 + X_2 + \ldots + X_n$$

[11]:

def get_Xy(df):
a = df.groupby(['product_category_name', 'unit_price']) \
.agg(qty_sum=pd.NamedAgg(column='qty', aggfunc='sum')) \
.assign(qty_sum=lambda d: np.log(d['qty_sum'])) \
.rename(columns={'qty_sum': 'quantity'})

b = df.groupby(['product_category_name', 'unit_price']) \
.agg(
freight_price=pd.NamedAgg(column='freight_price', aggfunc='mean'),
product_score=pd.NamedAgg(column='product_score', aggfunc='mean'),
weekday=pd.NamedAgg(column='weekday', aggfunc='mean'),
holiday=pd.NamedAgg(column='holiday', aggfunc='mean'),
customers=pd.NamedAgg(column='customers', aggfunc='mean'),
volume=pd.NamedAgg(column='volume', aggfunc='mean'),
s=pd.NamedAgg(column='s', aggfunc='mean'),
comp_1=pd.NamedAgg(column='comp_1', aggfunc='mean'),
ps1=pd.NamedAgg(column='ps1', aggfunc='mean'),
fp1=pd.NamedAgg(column='fp1', aggfunc='mean'),
comp_2=pd.NamedAgg(column='comp_2', aggfunc='mean'),
ps2=pd.NamedAgg(column='ps2', aggfunc='mean'),
fp2=pd.NamedAgg(column='fp2', aggfunc='mean'),
comp_3=pd.NamedAgg(column='comp_3', aggfunc='mean'),
ps3=pd.NamedAgg(column='ps3', aggfunc='mean'),
fp3=pd.NamedAgg(column='fp3', aggfunc='mean'),
lag_price=pd.NamedAgg(column='lag_price', aggfunc='mean')
)

Xy = a \
.join(b, how='left') \
.reset_index() \
.assign(unit_price=lambda d: np.log(d['unit_price']))

c = pd.get_dummies(Xy[['product_category_name']]).iloc[:,1:]

Xy = Xy.join(c, how='left').drop(columns=['product_category_name'])

X = Xy[Xy.columns.drop(['quantity'])]
y = Xy['quantity']

return X, y

X_tr, y_tr = get_Xy(df_tr)
X_te, y_te = get_Xy(df_te)

X_tr.shape, X_te.shape, y_tr.shape, y_te.shape

[11]:

((133, 26), (188, 26), (133,), (188,))

[12]:

loglog_plus_models = [
LinearRegression(),
RandomForestRegressor(n_jobs=-1, random_state=37, n_estimators=25),
LGBMRegressor(n_jobs=-1, random_state=37, num_leaves=10, n_estimators=50, max_depth=2)
]

get_performance_summary(loglog_plus_models)

[12]:

mae_tr mae_te mape_tr mape_te wmape_tr wmape_te range_mae_tr range_mae_te std_mae_tr std_mae_te mean_mae_tr mean_mae_te
OLS 21.592945 28.363898 1.090862 2.246045 0.660502 0.978245 0.040361 0.100226 0.357261 0.677601 0.660502 0.978245
RF 10.259959 20.777510 0.281573 1.100575 0.313840 0.716597 0.019177 0.073419 0.169754 0.496366 0.313840 0.716597
GBT 15.886894 19.211437 0.611473 1.051234 0.485961 0.662585 0.029695 0.067885 0.262853 0.458953 0.485961 0.662585

GBT seems to have the lowest validated MAE at 19.2.

[13]:

loglog_plus_tr = pd.DataFrame({
'p': X_tr['unit_price'].values,
'q_true': y_tr.values,
'q_lr': loglog_plus_models[0].predict(X_tr),
'q_rf': loglog_plus_models[1].predict(X_tr),
'q_gb': loglog_plus_models[2].predict(X_tr)
})

loglog_plus_te = pd.DataFrame({
'p': X_te['unit_price'].values,
'q_true': y_te.values,
'q_lr': loglog_plus_models[0].predict(X_te),
'q_rf': loglog_plus_models[1].predict(X_te),
'q_gb': loglog_plus_models[2].predict(X_te)
})

[14]:

do_scatter(loglog_plus_tr, 'TR')
do_scatter(loglog_plus_te, 'TE')
do_scatter_overlay(loglog_plus_tr, loglog_plus_te)


## 2.6. MERF model

The Mixed-Effect Random Forest (MERF) model uses a non-linear model to model the fixed effects but leaves the random effects as linear interaction terms.

$$y_i = f(X_i) + Z_i u_i + e_i$$

To use the MERF model, install it from git as follows.

pip install git+https://github.com/manifoldai/merf.git

[15]:

p2i = {c: i for i, c in enumerate(list(df['product_category_name'].unique()))}
i2p = {v: k for k, v in p2i.items()}

[16]:

def get_Xy(df):
a = df.groupby(['product_id', 'product_category_name', 'unit_price']) \
.agg(
qty_sum=pd.NamedAgg(column='qty', aggfunc='sum'),
freight_price=pd.NamedAgg(column='freight_price', aggfunc='mean'),
product_score=pd.NamedAgg(column='product_score', aggfunc='mean'),
comp_1=pd.NamedAgg(column='comp_1', aggfunc='mean'),
ps1=pd.NamedAgg(column='ps1', aggfunc='mean'),
fp1=pd.NamedAgg(column='fp1', aggfunc='mean'),
comp_2=pd.NamedAgg(column='comp_2', aggfunc='mean'),
ps2=pd.NamedAgg(column='ps2', aggfunc='mean'),
fp2=pd.NamedAgg(column='fp2', aggfunc='mean'),
comp_3=pd.NamedAgg(column='comp_3', aggfunc='mean'),
ps3=pd.NamedAgg(column='ps3', aggfunc='mean'),
fp3=pd.NamedAgg(column='fp3', aggfunc='mean'),
lag_price=pd.NamedAgg(column='lag_price', aggfunc='mean')
) \
.assign(
qty_sum=lambda d: np.log(d['qty_sum'])
) \
.rename(columns={'qty_sum': 'quantity'}) \
.reset_index() \
.drop(columns=['product_id']) \
.set_index(['product_category_name', 'unit_price'])

b = df.groupby(['product_category_name', 'unit_price']) \
.agg(
weekday=pd.NamedAgg(column='weekday', aggfunc='mean'),
holiday=pd.NamedAgg(column='holiday', aggfunc='mean'),
customers=pd.NamedAgg(column='customers', aggfunc='mean'),
volume=pd.NamedAgg(column='volume', aggfunc='mean'),
s=pd.NamedAgg(column='s', aggfunc='mean')
)

Xy = a \
.join(b, how='inner') \
.reset_index() \
.assign(unit_price=lambda d: np.log(d['unit_price']))

Xy = Xy.assign(product_category_name=lambda d: d['product_category_name'].map(p2i))

X = Xy[Xy.columns.drop(['quantity', 'product_category_name', 'weekday', 'holiday', 'customers', 'volume', 's'])]
Z = Xy[['weekday', 'holiday', 'customers', 's', 'lag_price']]
C = Xy['product_category_name']
y = Xy['quantity']

return X, Z, C, y

X_tr, Z_tr, C_tr, y_tr = get_Xy(df_tr)
X_te, Z_te, C_te, y_te = get_Xy(df_te)

[17]:

X_tr.shape, Z_tr.shape, C_tr.shape, y_tr.shape

[17]:

((148, 13), (148, 5), (148,), (148,))

[18]:

X_te.shape, Z_te.shape, C_te.shape, y_te.shape

[18]:

((208, 13), (208, 5), (208,), (208,))

[19]:

from merf import MERF

merf = MERF(
fixed_effects_model=LGBMRegressor(n_jobs=-1, random_state=37, num_leaves=10, n_estimators=25, max_depth=2),
max_iterations=30
)
merf.fit(X_tr, Z_tr, C_tr, y_tr)
get_performance(y_tr, merf.predict(X_tr, Z_tr, C_tr), y_te, merf.predict(X_te, Z_te, C_te))

INFO     [merf.py:307] Training GLL is -101.1968163425363 at iteration 1.
INFO     [merf.py:307] Training GLL is -171.16405780723312 at iteration 2.
INFO     [merf.py:307] Training GLL is -212.75117155142982 at iteration 3.
INFO     [merf.py:307] Training GLL is -237.3949966765289 at iteration 4.
INFO     [merf.py:307] Training GLL is -251.62779481924196 at iteration 5.
INFO     [merf.py:307] Training GLL is -261.7722548752886 at iteration 6.
INFO     [merf.py:307] Training GLL is -270.91408548631705 at iteration 7.
INFO     [merf.py:307] Training GLL is -277.8358575135619 at iteration 8.
INFO     [merf.py:307] Training GLL is -284.2161675842258 at iteration 9.
INFO     [merf.py:307] Training GLL is -291.0438736305228 at iteration 10.
INFO     [merf.py:307] Training GLL is -296.86277233482656 at iteration 11.
INFO     [merf.py:307] Training GLL is -301.2918249236685 at iteration 12.
INFO     [merf.py:307] Training GLL is -309.2639355840515 at iteration 13.
INFO     [merf.py:307] Training GLL is -313.2097808051912 at iteration 14.
INFO     [merf.py:307] Training GLL is -317.2383961719355 at iteration 15.
INFO     [merf.py:307] Training GLL is -321.7435059133349 at iteration 16.
INFO     [merf.py:307] Training GLL is -325.2538612240947 at iteration 17.
INFO     [merf.py:307] Training GLL is -328.3921568512495 at iteration 18.
INFO     [merf.py:307] Training GLL is -331.5095703868812 at iteration 19.
INFO     [merf.py:307] Training GLL is -335.3589905785209 at iteration 20.
INFO     [merf.py:307] Training GLL is -338.9548103244854 at iteration 21.
INFO     [merf.py:307] Training GLL is -342.3731168294115 at iteration 22.
INFO     [merf.py:307] Training GLL is -345.4442477917209 at iteration 23.
INFO     [merf.py:307] Training GLL is -348.75818816407207 at iteration 24.
INFO     [merf.py:307] Training GLL is -352.5510448500377 at iteration 25.
INFO     [merf.py:307] Training GLL is -355.102251919231 at iteration 26.
INFO     [merf.py:307] Training GLL is -357.6226061099023 at iteration 27.
INFO     [merf.py:307] Training GLL is -358.64542456205777 at iteration 28.
INFO     [merf.py:307] Training GLL is -361.5995433713272 at iteration 29.
INFO     [merf.py:307] Training GLL is -362.07091949111447 at iteration 30.

[19]:

mae_tr          15.590045
mae_te          18.394509
mape_tr          0.750941
mape_te          1.331177
wmape_tr         0.530664
wmape_te         0.701900
range_mae_tr     0.045853
range_mae_te     0.064998
std_mae_tr       0.388801
std_mae_te       0.529736
mean_mae_tr      0.530664
mean_mae_te      0.701900
dtype: float64


We get a slight improvement with the MAE at 18.4. So far, the MAEs for the different models are as follows.

• Log-log MAE: 22.9

• Log-log + Features MAE: 19.2

• MERF MAE: 18.4

Below are the feature importances for the fixed effects variables.

[20]:

pd.Series(merf.fe_model.feature_importances_, X_tr.columns) \
.sort_values(ascending=False)

[20]:

unit_price       25
freight_price    11
fp1               9
fp2               7
comp_3            7
ps3               4
ps2               3
fp3               2
lag_price         2
product_score     0
comp_1            0
ps1               0
comp_2            0
dtype: int32


Below are the coefficients for the random effects by the grouping variable (which are the product categories).

[21]:

merf.trained_b.sort_index() \
.rename(columns={0: 'weekday', 1: 'holiday', 2: 'customers', 3: 'volume', 4: 's', 5: 'lag_price'}) \
.T \
.rename(columns=i2p).T

[21]:

weekday holiday customers volume s
bed_bath_table -0.007945 -0.186044 0.000864 0.035106 -0.002121
garden_tools -0.030065 -0.062398 0.001513 0.052798 -0.004259
consoles_games -0.004937 -0.007303 0.000203 0.009453 -0.000838
health_beauty 0.016584 -0.038887 -0.000543 -0.018656 0.001668
cool_stuff 0.003853 0.133319 0.000107 -0.026230 0.001764
perfumery -0.003835 -0.027635 0.000262 0.009871 -0.000816
computers_accessories -0.028299 0.319809 0.001429 -0.003708 -0.001072
watches_gifts -0.030244 0.035122 0.002325 0.034548 -0.002885
furniture_decor 0.007239 0.042501 -0.000677 -0.013164 0.000763
[22]:

def do_merf_scatter(df, title_suffix):
fig, ax = plt.subplots(figsize=(15, 3.5))

df.plot(kind='scatter', x='q_true', y='p', ax=ax, color='r')
df.plot(kind='scatter', x='q_pred', y='p', ax=ax, color='b')

ax.set_title(f'q vs p, {title_suffix}')

fig.tight_layout()

[23]:

do_merf_scatter(pd.DataFrame({
'p': X_tr['unit_price'].values,
'q_true': y_tr.values,
'q_pred': merf.predict(X_tr, Z_tr, C_tr)
}), 'TR')

do_merf_scatter(pd.DataFrame({
'p': X_te['unit_price'].values,
'q_true': y_te.values,
'q_pred': merf.predict(X_te, Z_te, C_te)
}), 'TE')


## 2.7. MERF + features

Let’s make the MERF models more sophisticated by adding in more fixed and random effects.

[24]:

p2i = {c: i for i, c in enumerate(list(df['product_id'].unique()))}
i2p = {v: k for k, v in p2i.items()}

[25]:

c2i = {c: i for i, c in enumerate(list(df['product_category_name'].unique()))}
i2c = {v: k for k, v in p2i.items()}

[26]:

def get_Xy(df):
a = df.groupby(['product_id', 'product_category_name', 'unit_price']) \
.agg(
qty_sum=pd.NamedAgg(column='qty', aggfunc='sum'),
freight_price=pd.NamedAgg(column='freight_price', aggfunc='mean'),
product_score=pd.NamedAgg(column='product_score', aggfunc='mean'),
comp_1=pd.NamedAgg(column='comp_1', aggfunc='mean'),
ps1=pd.NamedAgg(column='ps1', aggfunc='mean'),
fp1=pd.NamedAgg(column='fp1', aggfunc='mean'),
comp_2=pd.NamedAgg(column='comp_2', aggfunc='mean'),
ps2=pd.NamedAgg(column='ps2', aggfunc='mean'),
fp2=pd.NamedAgg(column='fp2', aggfunc='mean'),
comp_3=pd.NamedAgg(column='comp_3', aggfunc='mean'),
ps3=pd.NamedAgg(column='ps3', aggfunc='mean'),
fp3=pd.NamedAgg(column='fp3', aggfunc='mean'),
lag_price=pd.NamedAgg(column='lag_price', aggfunc='mean')
) \
.assign(
qty_sum=lambda d: np.log(d['qty_sum'])
) \
.rename(columns={'qty_sum': 'quantity'}) \
.reset_index() \
.assign(
bed1=lambda d: np.where(d['product_id']=='bed1', 1, 0),
garden5=lambda d: np.where(d['product_id']=='garden5', 1, 0),
consoles1=lambda d: np.where(d['product_id']=='consoles1', 1, 0),
garden7=lambda d: np.where(d['product_id']=='garden7', 1, 0),
health9=lambda d: np.where(d['product_id']=='health9', 1, 0),
cool4=lambda d: np.where(d['product_id']=='cool4', 1, 0),
health3=lambda d: np.where(d['product_id']=='health3', 1, 0),
perfumery1=lambda d: np.where(d['product_id']=='perfumery1', 1, 0),
cool5=lambda d: np.where(d['product_id']=='cool5', 1, 0),
health8=lambda d: np.where(d['product_id']=='health8', 1, 0),
garden4=lambda d: np.where(d['product_id']=='garden4', 1, 0),
computers5=lambda d: np.where(d['product_id']=='computers5', 1, 0),
garden10=lambda d: np.where(d['product_id']=='garden10', 1, 0),
computers6=lambda d: np.where(d['product_id']=='computers6', 1, 0),
health6=lambda d: np.where(d['product_id']=='health6', 1, 0),
garden6=lambda d: np.where(d['product_id']=='garden6', 1, 0),
health10=lambda d: np.where(d['product_id']=='health10', 1, 0),
watches2=lambda d: np.where(d['product_id']=='watches2', 1, 0),
health1=lambda d: np.where(d['product_id']=='health1', 1, 0),
garden8=lambda d: np.where(d['product_id']=='garden8', 1, 0),
garden9=lambda d: np.where(d['product_id']=='garden9', 1, 0),
watches6=lambda d: np.where(d['product_id']=='watches6', 1, 0),
cool3=lambda d: np.where(d['product_id']=='cool3', 1, 0),
perfumery2=lambda d: np.where(d['product_id']=='perfumery2', 1, 0),
cool2=lambda d: np.where(d['product_id']=='cool2', 1, 0),
computers1=lambda d: np.where(d['product_id']=='computers1', 1, 0),
consoles2=lambda d: np.where(d['product_id']=='consoles2', 1, 0),
health5=lambda d: np.where(d['product_id']=='health5', 1, 0),
watches8=lambda d: np.where(d['product_id']=='watches8', 1, 0),
furniture4=lambda d: np.where(d['product_id']=='furniture4', 1, 0),
watches5=lambda d: np.where(d['product_id']=='watches5', 1, 0),
health7=lambda d: np.where(d['product_id']=='health7', 1, 0),
bed3=lambda d: np.where(d['product_id']=='bed3', 1, 0),
garden3=lambda d: np.where(d['product_id']=='garden3', 1, 0),
bed2=lambda d: np.where(d['product_id']=='bed2', 1, 0),
furniture3=lambda d: np.where(d['product_id']=='furniture3', 1, 0),
watches4=lambda d: np.where(d['product_id']=='watches4', 1, 0),
watches3=lambda d: np.where(d['product_id']=='watches3', 1, 0),
furniture2=lambda d: np.where(d['product_id']=='furniture2', 1, 0),
garden2=lambda d: np.where(d['product_id']=='garden2', 1, 0),
furniture1=lambda d: np.where(d['product_id']=='furniture1', 1, 0),
health2=lambda d: np.where(d['product_id']=='health2', 1, 0),
garden1=lambda d: np.where(d['product_id']=='garden1', 1, 0),
cool1=lambda d: np.where(d['product_id']=='cool1', 1, 0),
computers4=lambda d: np.where(d['product_id']=='computers4', 1, 0),
watches7=lambda d: np.where(d['product_id']=='watches7', 1, 0),
computers3=lambda d: np.where(d['product_id']=='computers3', 1, 0),
health4=lambda d: np.where(d['product_id']=='health4', 1, 0),
watches1=lambda d: np.where(d['product_id']=='watches1', 1, 0),
computers2=lambda d: np.where(d['product_id']=='computers2', 1, 0),
bed4=lambda d: np.where(d['product_id']=='bed4', 1, 0),
bed5=lambda d: np.where(d['product_id']=='bed5', 1, 0)
) \
.drop(columns=['product_id']) \
.set_index(['product_category_name', 'unit_price'])

b = df.groupby(['product_category_name', 'unit_price']) \
.agg(
weekday=pd.NamedAgg(column='weekday', aggfunc='mean'),
holiday=pd.NamedAgg(column='holiday', aggfunc='mean'),
customers=pd.NamedAgg(column='customers', aggfunc='mean'),
volume=pd.NamedAgg(column='volume', aggfunc='mean'),
s=pd.NamedAgg(column='s', aggfunc='mean')
) \
.reset_index() \
.assign(
bed_bath_table=lambda d: np.where(d['product_category_name']=='bed_bath_table', 1, 0),
garden_tools=lambda d: np.where(d['product_category_name']=='garden_tools', 1, 0),
consoles_games=lambda d: np.where(d['product_category_name']=='consoles_games', 1, 0),
health_beauty=lambda d: np.where(d['product_category_name']=='health_beauty', 1, 0),
cool_stuff=lambda d: np.where(d['product_category_name']=='cool_stuff', 1, 0),
perfumery=lambda d: np.where(d['product_category_name']=='perfumery', 1, 0),
computers_accessories=lambda d: np.where(d['product_category_name']=='computers_accessories', 1, 0),
furniture_decor=lambda d: np.where(d['product_category_name']=='furniture_decor', 1, 0)
) \
.set_index(['product_category_name', 'unit_price'])

Xy = a \
.join(b, how='inner') \
.reset_index() \
.assign(unit_price=lambda d: np.log(d['unit_price']))

Xy = Xy.assign(product_category_name=lambda d: d['product_category_name'].map(c2i)) \
.query('unit_price < 5.7') \
.query('quantity < 4.3')

X = Xy[Xy.columns.drop(['quantity', 'product_category_name', 'bed_bath_table', 'garden_tools', 'consoles_games', 'health_beauty', 'cool_stuff', 'perfumery', 'computers_accessories', 'watches_gifts', 'furniture_decor', 'weekday', 'holiday', 'customers', 'volume', 's'])]
Z = Xy[['bed_bath_table', 'garden_tools', 'consoles_games', 'health_beauty', 'cool_stuff', 'perfumery', 'computers_accessories', 'watches_gifts', 'furniture_decor', 'weekday', 'holiday', 'customers', 's']]
C = Xy['product_category_name']
y = Xy['quantity']

return X, Z, C, y

X_tr, Z_tr, C_tr, y_tr = get_Xy(df_tr)
X_te, Z_te, C_te, y_te = get_Xy(df_te)

[27]:

X_tr.shape, Z_tr.shape, C_tr.shape, y_tr.shape

[27]:

((123, 65), (123, 13), (123,), (123,))

[28]:

X_te.shape, Z_te.shape, C_te.shape, y_te.shape

[28]:

((184, 65), (184, 13), (184,), (184,))

[29]:

merf = MERF(
fixed_effects_model=LGBMRegressor(n_jobs=-1, random_state=37, num_leaves=500, n_estimators=5, max_depth=100),
max_iterations=30
)
merf.fit(X_tr, Z_tr, C_tr, y_tr)
get_performance(y_tr, merf.predict(X_tr, Z_tr, C_tr), y_te, merf.predict(X_te, Z_te, C_te))

INFO     [merf.py:307] Training GLL is -88.13270721590854 at iteration 1.
INFO     [merf.py:307] Training GLL is -131.66918167831366 at iteration 2.
INFO     [merf.py:307] Training GLL is -158.3396085429746 at iteration 3.
INFO     [merf.py:307] Training GLL is -172.36626681950983 at iteration 4.
INFO     [merf.py:307] Training GLL is -183.273845220244 at iteration 5.
INFO     [merf.py:307] Training GLL is -195.50527698451978 at iteration 6.
INFO     [merf.py:307] Training GLL is -206.25905996127523 at iteration 7.
INFO     [merf.py:307] Training GLL is -216.38182995185898 at iteration 8.
INFO     [merf.py:307] Training GLL is -226.02365508411302 at iteration 9.
INFO     [merf.py:307] Training GLL is -235.22712598205717 at iteration 10.
INFO     [merf.py:307] Training GLL is -244.08248099738415 at iteration 11.
INFO     [merf.py:307] Training GLL is -252.6305118737953 at iteration 12.
INFO     [merf.py:307] Training GLL is -260.89381991319885 at iteration 13.
INFO     [merf.py:307] Training GLL is -268.9004674811783 at iteration 14.
INFO     [merf.py:307] Training GLL is -276.6759567855772 at iteration 15.
INFO     [merf.py:307] Training GLL is -284.2689734523188 at iteration 16.
INFO     [merf.py:307] Training GLL is -291.6957739036444 at iteration 17.
INFO     [merf.py:307] Training GLL is -300.27162761932044 at iteration 18.
INFO     [merf.py:307] Training GLL is -307.3709908011281 at iteration 19.
INFO     [merf.py:307] Training GLL is -314.25093506142235 at iteration 20.
INFO     [merf.py:307] Training GLL is -320.95384324169385 at iteration 21.
INFO     [merf.py:307] Training GLL is -327.49634087153373 at iteration 22.
INFO     [merf.py:307] Training GLL is -334.8272808170837 at iteration 23.
INFO     [merf.py:307] Training GLL is -339.9610697921489 at iteration 24.
INFO     [merf.py:307] Training GLL is -347.1184754251438 at iteration 25.
INFO     [merf.py:307] Training GLL is -353.1474426017086 at iteration 26.
INFO     [merf.py:307] Training GLL is -358.9569538145549 at iteration 27.
INFO     [merf.py:307] Training GLL is -364.6142070424412 at iteration 28.
INFO     [merf.py:307] Training GLL is -370.1421471030787 at iteration 29.
INFO     [merf.py:307] Training GLL is -375.51542805296117 at iteration 30.

[29]:

mae_tr           7.865860
mae_te          12.273571
mape_tr          0.727956
mape_te          1.127332
wmape_tr         0.446264
wmape_te         0.663436
range_mae_tr     0.109248
range_mae_te     0.175337
std_mae_tr       0.506936
std_mae_te       0.772760
mean_mae_tr      0.446264
mean_mae_te      0.663436
dtype: float64


So far, the MAEs for the different models are as follows.

• Log-log MAE: 22.9

• Log-log + Features MAE: 19.2

• MERF MAE: 18.4

• MERF + Features MAE: 12.3

[30]:

gbm = LGBMRegressor(n_jobs=-1, random_state=37, num_leaves=10, n_estimators=50, max_depth=100)
gbm.fit(X_tr.join(Z_tr), y_tr)

[30]:

LGBMRegressor(max_depth=100, n_estimators=50, num_leaves=10, random_state=37)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
[31]:

get_performance(y_tr, gbm.predict(X_tr.join(Z_tr)), y_te, gbm.predict(X_te.join(Z_te)))

[31]:

mae_tr           6.141472
mae_te          10.011228
mape_tr          0.483427
mape_te          1.043864
wmape_tr         0.348432
wmape_te         0.541147
range_mae_tr     0.085298
range_mae_te     0.143018
std_mae_tr       0.395804
std_mae_te       0.630320
mean_mae_tr      0.348432
mean_mae_te      0.541147
dtype: float64


So far, the MAEs for the different models are as follows.

• Log-log MAE: 22.9

• Log-log + Features MAE: 19.2

• MERF MAE: 18.4

• MERF + Features MAE: 12.3

• MERF + Features (GB) MAE: 10.0

## 2.8. Random forest

Let’s just use a random forest regressor without any mixed modelling.

[32]:

rf = RandomForestRegressor(
n_jobs=-1,
random_state=37,
n_estimators=500,
criterion='absolute_error',
bootstrap=True,
warm_start=True,
min_samples_split=5
)
rf.fit(X_tr.join(Z_tr), y_tr)

get_performance(y_tr, rf.predict(X_tr.join(Z_tr)), y_te, rf.predict(X_te.join(Z_te)))

[32]:

mae_tr           3.733740
mae_te          10.506863
mape_tr          0.264830
mape_te          0.974256
wmape_tr         0.211831
wmape_te         0.567939
range_mae_tr     0.051857
range_mae_te     0.150098
std_mae_tr       0.240631
std_mae_te       0.661526
mean_mae_tr      0.211831
mean_mae_te      0.567939
dtype: float64


So far, the MAEs for the different models are as follows.

• Log-log MAE: 22.9

• Log-log + Features MAE: 19.2

• MERF MAE: 18.4

• MERF + Features MAE: 12.3

• MERF + Features (GB) MAE: 10.0

• RF MAE: 10.5

[33]:

do_merf_scatter(pd.DataFrame({
'p': np.exp(X_tr['unit_price'].values),
'q_true': np.exp(y_tr.values),
'q_pred': np.exp(rf.predict(X_tr.join(Z_tr)))
}), 'TR')

do_merf_scatter(pd.DataFrame({
'p': np.exp(X_te['unit_price'].values),
'q_true': np.exp(y_te.values),
'q_pred': np.exp(rf.predict(X_te.join(Z_te)))
}), 'TE')