Chapter 2 - Fancy Tricks with Simple Numbers

Numeric Data

binarization - split data into two bins
- ex: split data based on if it is ≥ 1 or <1
Quantization or Binning
- maps a continuous number into a discrete one
- Fix-Width Binning
  - each bin contains a specific numeric range
- Quantile Binning
  - divide data into equal portions

variance-stabilizing transformations
change the distribution of the variable so that the variance is no longer dependent on the mean

great for positive numbers with a heavy-tail distribution
use cross-validation to obtain not only an estimate of the score but also a variance, which helps us gauge whether the differences between the two models are meaningful
log transform the target variable too

$$ \tilde{x} = \begin{cases} \frac{x^\lambda - 1}{\lambda}, & if \; \lambda \ne \; 0 \\ \ln{x}, & if \; \lambda = 0 \end{cases}

setting $\lambda < 1$ compresses the higher values
setting $\lambda > 1$ compresses the lower values
$\lambda=0$ → log transform
only works when the data is positive
- if data is negative, shift values by adding a fixed constant
need to determine the value for $\lambda$
- use maximum likelihood
  - find $\lambda$ that maximizes the Gaussian likelihood of the resulting transformed signal
- Bayesian methods
optimal Box-Cox transform deflates the tail more than the log transform