9. PDFs and CDFs

This notebook demonstrates how to move between a probability density function PDF and cumulative density function CDF. If one has a PDF, a CDF may be derived from integrating over the PDF; if one has a CDF, the PDF may be derived from taking the derivative over the CDF.

9.1. Standard normal distribution

Here, we visualize the PDF and CDF for the standard normal distribution. The functions scipy.stats.norm.pdf and scipy.stats.norm.cdf will be used to generate the curves and data.

[1]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import norm
import warnings

plt.style.use('ggplot')
np.random.seed(37)
warnings.filterwarnings('ignore')

[2]:

x = np.arange(-6, 6.1, 0.1)
y_pdf = norm.pdf(x)
y_cdf = norm.cdf(x)

fig, ax = plt.subplots(figsize=(15, 6))

ax = [ax, ax.twinx()]

_ = ax[0].plot(x, y_pdf, label='pdf', color='r')
_ = ax[1].plot(x, y_cdf, label='cdf', color='b')

_ = ax[0].tick_params(axis='y', labelcolor='r')
_ = ax[1].tick_params(axis='y', labelcolor='b')

_ = ax[0].set_ylabel('pdf', color='r')
_ = ax[1].set_ylabel('cdf', color='b')

_ = ax[0].set_title('PDF and CDF of standard normal')

We will use scipy.misc.derivative and scipy.integrate.quad to take the derivative of the CDF to get the PDF and to integrate the PDF to get the CDF, respectively.

[3]:

from scipy.misc import derivative
from scipy.integrate import quad

y_cdf = np.array([tup[0] for tup in [quad(norm.pdf, a, b) for a, b in [(a, b) for a, b in zip(x, x[1:len(x)])]]] + [0]).cumsum()
y_pdf = derivative(norm.cdf, x, dx=1e-6)

fig, ax = plt.subplots(1, 2, figsize=(15, 6))

_ = ax[0].plot(x, y_pdf, color='r')
_ = ax[1].plot(x, y_cdf, color='b')

_ = ax[0].set_title('PDF from derivative over CDF')
_ = ax[1].set_title('CDF from integration over PDF')

9.2. Log-normal distribution

[4]:

from scipy.stats import lognorm

lognorm_pdf = lambda x: lognorm.pdf(x, 1)
lognorm_cdf = lambda x: lognorm.cdf(x, 1)

x = np.arange(0, 10.1, 0.05)
y_pdf = lognorm_pdf(x)
y_cdf = lognorm_cdf(x)

fig, ax = plt.subplots(figsize=(15, 6))

ax = [ax, ax.twinx()]

_ = ax[0].plot(x, y_pdf, label='pdf', color='r')
_ = ax[1].plot(x, y_cdf, label='cdf', color='b')

_ = ax[0].tick_params(axis='y', labelcolor='r')
_ = ax[1].tick_params(axis='y', labelcolor='b')

_ = ax[0].set_ylabel('pdf', color='r')
_ = ax[1].set_ylabel('cdf', color='b')

_ = ax[0].set_title('PDF and CDF of log-normal')

[5]:

y_cdf = np.array([tup[0] for tup in [quad(lognorm_pdf, a, b) for a, b in [(a, b) for a, b in zip(x, x[1:len(x)])]]] + [0]).cumsum()
y_pdf = derivative(lognorm_cdf, x, dx=1e-6)

fig, ax = plt.subplots(1, 2, figsize=(15, 6))

_ = ax[0].plot(x, y_pdf, color='r')
_ = ax[1].plot(x, y_cdf, color='b')

_ = ax[0].set_title('PDF from derivative over CDF')
_ = ax[1].set_title('CDF from integration over PDF')

9.3. Learn a PDF from arbitrary CDF

We will generate an arbitrary CDF using the logistic function.

[6]:

logistic = lambda x, L=1, x_0=0, k=1: L / (1 + np.exp(-k * (x - x_0)))

x = np.arange(-6, 6.1, 0.1)
y = logistic(x)
x = x + 6.0

fig, ax = plt.subplots(figsize=(15, 6))
_ = ax.plot(x, y, color='b')
_ = ax.set_title('Basic s-curve using logistic function')

The parameters, \(L\), \(x_0\) and \(k\), for the logistic function will be learned.

[7]:

from scipy.optimize import curve_fit

L_estimate = y.max()
x_0_estimate = np.median(x)
k_estimate = 1.0

p_0 = [L_estimate, x_0_estimate, k_estimate]

popt, pcov = curve_fit(logistic, x, y, p_0, method='dogbox')

L, x_0, k = popt[0], popt[1], popt[2]

Assuming the PDF is log-normal, we will take the derivative of the CDF to estimate the PDF.

[8]:

logistic = lambda x, L=L, x_0=x_0, k=k: L / (1 + np.exp(-k * (x - x_0)))

y_pdf = derivative(lognorm_cdf, x, dx=1e-6)

fig, ax = plt.subplots(figsize=(15, 6))

_ = ax.plot(x, y_pdf, color='r')
_ = ax.set_title('Log-normal PDF from derivative over CDF')

9.4. Learn a CDF from arbitrary PDF

We will generate a guassian-mixture (GM) PDF and derive the CDF using integration.

[9]:

N = 1000
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)), np.random.normal(5, 1, int(0.7 * N))))

[10]:

s = pd.Series(X)

fig, ax = plt.subplots(figsize=(15, 6))

_ = s.plot(kind='kde', bw_method='scott', ax=ax)
_ = ax.set_title('Mixed gaussian PDF')

We then use a kernel density estimator to learn the PDF.

[11]:

from sklearn.neighbors import KernelDensity

kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X[:, np.newaxis])
gmm_pdf = lambda x: np.exp(kde.score(np.array([x]).reshape(-1, 1)))

Finally, the CDF of the PDF will be estimated using integration.

[12]:

%%time
x = np.arange(-5, 10.1, 0.1)
y_cdf = np.array([tup[0] for tup in [quad(gmm_pdf, a, b) for a, b in [(a, b) for a, b in zip(x, x[1:len(x)])]]] + [0]).cumsum()

CPU times: user 522 ms, sys: 0 ns, total: 522 ms
Wall time: 548 ms

[13]:

fig, ax = plt.subplots(figsize=(15, 6))

_ = ax.plot(x, y_cdf, color='b')
_ = ax.set_title('CDF from integration over gaussian-mixture PDF')