Chapter 4 - The Effects of Feature Scaling: From Bags-of-Words to TF-Idf

TF-IDF: A Simple Twist on Bag-of-Words

term frequency-inverse document frequency
- looks at normalized count where each word is divided by the number of documents this word appears in
- tf-idf(w,d) = bow(w,d) * log ( N / (num of documents in which word w appears) )
  - N: total number of documents in dataset
  - N/(# of documents in which w appears in) : inverse document frequency
- if a word appears in many documents, then its inverse document frequency is close to 1
  - log turns 1 into 0 and makes large numbers smaller
    
    Plotting sentences in the feature space of three words: “puppy”, “cat”, and “is”.
- tf-idf makes rare words more prominent and effectively ignores common words
  
  tf-idf representation of same sentences above. Sets “is” to 0 since it occurred in all documents (in this case sentences).
- tf-idf is an example of feature scaling because it transforms word count features through multiplication with a constant
- data matrix: contains data points represented as fixed-length flat vectors
  - with BOW, the data matrix is called document-term matrix
    
    An example of document-term matrix of five documents (rows) and seven words (columns/features)
- CODE:
  
  book-study/Feature Eng - Ch 4 - NLP - Naive Bayes and Logistic Regression .ipynb at main · Biased-Outliers/book-study
- training a linear classifier boils down to finding the best linear combination of features
  - these are the column vectors of the data matrix
- a large column space → little linear dependency between the features
- null space → ”novel” data points that cannot be formulated as linear combinations of existing data
the right feature scaling can be helpful for classification
- accentuates the informative words and down-weights the common words