looks at normalized count where each word is divided by the number of documents this word appears in
tf-idf(w,d) = bow(w,d) * log ( N / (num of documents in which word w appears) )
if a word appears in many documents, then its inverse document frequency is close to 1
log turns 1 into 0 and makes large numbers smaller

Plotting sentences in the feature space of three words: “puppy”, “cat”, and “is”.
tf-idf makes rare words more prominent and effectively ignores common words

tf-idf representation of same sentences above. Sets “is” to 0 since it occurred in all documents (in this case sentences).
tf-idf is an example of feature scaling because it transforms word count features through multiplication with a constant
data matrix: contains data points represented as fixed-length flat vectors
with BOW, the data matrix is called document-term matrix

An example of document-term matrix of five documents (rows) and seven words (columns/features)
CODE:
training a linear classifier boils down to finding the best linear combination of features
a large column space → little linear dependency between the features
null space → ”novel” data points that cannot be formulated as linear combinations of existing data