Numeric Data
- things to consider
- magnitude of data (strictly positive or strictly negative or both)
- scale of the data
- do we need to normalize the data so the output stays on an expected scale
- features that are NOT sensitive to scale
- logical functions (binary)
- step functions (models based on space-partitioning)
- rescale the inputs periodically
- bin-counting
- distribution: summarizes the probability of taking on a particular value
- Linear Regression
- assumes the prediction errors are Gaussian distributed
- transform data to make it more normal
- features that might interact with each other
- combine features hoping that it will capture important information
Dealing with Counts
- binarization - split data into two bins
- ex: split data based on if it is ≥ 1 or <1
- Quantization or Binning
- maps a continuous number into a discrete one
- Fix-Width Binning
- each bin contains a specific numeric range
- Quantile Binning
Power Transforms
- variance-stabilizing transformations
- change the distribution of the variable so that the variance is no longer dependent on the mean
Log Transform
- great for positive numbers with a heavy-tail distribution
- use cross-validation to obtain not only an estimate of the score but also a variance, which helps us gauge whether the differences between the two models are meaningful
- log transform the target variable too
Box-Cox Transformation
$$
\tilde{x} = \begin{cases} \frac{x^\lambda - 1}{\lambda}, & if \; \lambda \ne \; 0 \\ \ln{x}, & if \; \lambda = 0 \end{cases}
$$
- setting $\lambda < 1$ compresses the higher values
- setting $\lambda > 1$ compresses the lower values
- $\lambda=0$ → log transform
- only works when the data is positive
- if data is negative, shift values by adding a fixed constant
- need to determine the value for $\lambda$
- use maximum likelihood
- find $\lambda$ that maximizes the Gaussian likelihood of the resulting transformed signal
- Bayesian methods
- optimal Box-Cox transform deflates the tail more than the log transform