Before Transformers: The NLP Fundamentals You Still Need

Bag-of-words, TF-IDF, Text classification using linear models (LogReg, SVM, Naive Bayes), Topic modeling (LDA, NMF), Information retrieval basics (BM25, TF-IDF search, ranking)

Mar 14, 2026

A lot of people assume modern NLP systems are powered entirely by transformers. But if you look inside many real production pipelines, you’ll still find something much simpler sitting at the core.

TF-IDF. BM25. Bag-of-words features.

Not because companies are stuck in the past, but because these representations solve certain problems incredibly well. Search engines, document ranking systems, internal knowledge bases, and recommendation pipelines often rely on them because they are fast, interpretable, and surprisingly effective.

The first time I saw this in practice was when a simple TF-IDF pipeline outperformed a much heavier neural model we were experimenting with. The neural model understood language better in theory, but the classical system was faster, easier to tune, and more reliable for the specific task. And this pattern shows up everywhere, from search systems to ML interviews.

So in this post, we’ll walk through some of the most common questions around classical NLP representations and why these ideas still matter today.

How does TF-IDF improve on bag-of-words?

A bag-of-words model represents a document by simply counting how many times each word appears. It ignores grammar, order, and context. If the word “contract” appears five times, the vector stores a 5. If “the” appears ten times, it stores a 10. Every word is treated the same way, and the model has no sense of which words are informative versus which ones are just filler.

TF-IDF adds weighting on top of this. TF, or term frequency, still measures how often a word appears in a document. IDF, or inverse document frequency, measures how rare that word is across the entire collection of documents. Words that show up in almost every document get a low IDF, and words that appear only in specific contexts get a higher IDF.

This weighting is what improves on bag-of-words. In a plain bag-of-words model, very common words like “the,” “and,” “of,” and “is” dominate the representation simply because they appear everywhere. They drown out more meaningful terms. TF-IDF fixes this by pushing down the influence of these high-frequency, low-information words. At the same time, it boosts words that are frequent in one document but uncommon across the corpus. For example, in a set of medical notes, a word like “metastasis” gets a high IDF because it only appears in certain cases, so it becomes a strong indicator of topic. A plain bag-of-words model would treat it no differently than any other word of the same count.

In practical tasks like search or classification, this makes a big difference. Documents about similar topics share high-TF-IDF words, not common filler words. Documents about different topics no longer look artificially similar because they both contain “the” and “is.” So TF-IDF keeps the simplicity of bag-of-words but adds a relevance signal that helps the model focus on what actually distinguishes one document from another.

Why do BoW and TF-IDF often lead to very sparse vectors?

Bag-of-Words and TF-IDF almost always produce very sparse vectors because of how they represent language.

Both methods build a vocabulary that contains every unique word across the entire corpus. In many real datasets this vocabulary can easily reach tens or hundreds of thousands of words. But any individual document only contains a tiny fraction of those words. If a document has 200 words and the vocabulary has 50,000 entries, then 49,800 of those entries will be zero for that document. That is what gives you sparsity: most positions in the vector have no value.

This sparsity becomes even more extreme when you account for morphology, misspellings, and rare terms. The string “computing” and “computer” are different dimensions. “contractual,” “contracting,” and “contract” are different dimensions. Misspellings like “behaivor” create brand-new dimensions that almost no document uses. As a result, the number of unique word forms grows faster than the number of meaningful patterns, which makes the matrix mostly zeros.

TF-IDF does not fix sparsity because it still relies on the same huge vocabulary. It only changes the weighting of the nonzero entries. A word that does not appear in a document still gets a zero, no matter what its IDF is. So you end up with high-dimensional vectors where each document activates only a very small subset of features.

In short, BoW and TF-IDF are sparse because natural language contains an enormous number of rare words, and these models allocate one dimension per unique word. Each document uses only a tiny slice of that space, which means the vectors are mostly zeros.

What happens to BoW features if you shuffle the word order of a sentence?

If you shuffle the words in a sentence, the Bag-of-Words features do not change at all. A BoW model only cares about which words appear and how many times they appear. It does not keep track of order, structure, or grammar. For example, the sentences: “dogs chase cats” and “cats chase dogs” and “chase dogs cats” all produce exactly the same BoW vector because they contain the same three words with the same frequencies.

This is both the strength and the weakness of BoW. It makes the representation simple and stable for tasks where order does not matter much, like very rough topic classification. But it also means the model cannot tell the difference between very different meanings if the same words show up. In a sentiment or intent task, or in anything that depends on who is doing what to whom, this loss of order can cause major errors.

When might TF-IDF hurt performance compared to simple counts?

TF-IDF is usually an improvement over raw counts, but there are cases where it can actually hurt performance. The common thread is that TF-IDF down-weights frequent words, and sometimes those frequent words are exactly the ones that matter.

One situation is when the task depends on very common function words or structural cues. For example, in sentiment analysis, the word “not” appears in many documents and therefore gets a low IDF score. But “not” is extremely important for flipping polarity. A simple count model treats it normally, while TF-IDF can push it down so far that the model almost ignores it. You see similar problems with words like “never,” “no,” or even auxiliary verbs that help express tense or emphasis.

Another case is when the collection is small or highly homogeneous. If the documents all come from the same domain, many informative words appear in most documents, which means TF-IDF incorrectly labels them as unimportant. In a medical corpus, terms like “patient,” “treatment,” and “pain” might show up everywhere, but they still help classify subtle topics when combined with other words. A raw count model preserves their influence, while TF-IDF suppresses them too aggressively.

TF-IDF can also hurt performance in tasks where document length varies a lot. TF-IDF normalizes based on document-level statistics, so shorter documents may lose signal, and longer documents with more repeated words may get over-penalized. Sometimes simple counts give more stable features when the input length has its own meaning, such as chat messages or short customer reviews.

Finally, TF-IDF can backfire when the model is already powerful enough to learn which frequent words matter. For example, a linear classifier or even a small neural model trained on raw counts can learn to down-weight unhelpful words on its own. In these cases, TF-IDF may over-correct, especially if the IDF values were computed on a limited or skewed corpus.

So although TF-IDF usually improves discrimination, it can hurt when frequent words carry important semantics, when the domain is narrow, when document lengths vary widely, or when the classifier is already capable of learning its own weights from simple counts.

How would you handle very rare words when building a BoW vocabulary?

Rare words can easily explode the feature space and introduce noise, so you want to keep the useful ones and either merge or remove the rest.

The simplest thing to do is set a minimum frequency threshold. For example, you might keep only words that appear in at least five documents. Words that appear once or twice are often typos, names, or artifacts from tokenization. Removing them shrinks the vocabulary dramatically and usually improves generalization. This works well in broad domains where extremely rare words do not add much signal.

Another option is to replace very rare words with a special “unknown” token. Instead of dropping them entirely, you treat all unseen or rare forms as one shared bucket. This helps the model learn how to handle unusual inputs without dedicating a separate feature to each. It is a good approach when you think rare words still hold some meaning but you cannot model them individually.

If the domain is noisy, I sometimes apply light normalization so that rare words collapse into more frequent ones. For example, stemming or lemmatization can merge inflected variants and reduce the number of unique surface forms. This is helpful when the rarity comes from spelling variation rather than real semantic difference.

In some settings, it makes sense to keep rare words but reduce their impact. You can down-weight them or apply feature selection using methods like chi-square or information gain. If a rare word actually helps predict a class label, these methods will keep it even if its frequency is low. This is common in specialized domains like legal or biomedical text, where rare technical terms might be important.

Sampling strategies also help. You can cap the vocabulary at the top N most frequent terms to keep things stable, especially in huge corpora. If you expect many rare but irrelevant words, a frequency-based cutoff is usually the cleanest solution.

So the general rule is to remove, merge, or down-weight rare words unless the domain suggests they carry meaning. The goal is to prevent the vocabulary from becoming so large and sparse that the model struggles, while still keeping enough signal to capture important distinctions.

How does vocabulary size affect model performance and memory usage?

Vocabulary size affects both model performance and memory usage in very direct ways, and the impact shows up differently depending on the type of model you are using.

In classical bag-of-words or TF-IDF models, a large vocabulary means a huge number of input features. Most of them are zero for most documents, which creates extremely sparse vectors. This increases memory usage because the model has to store a giant feature space, and it slows down training and inference since the model has to process far more dimensions. It can also hurt accuracy. A very large vocabulary pulls in lots of rare or noisy words, which adds variance and makes the model overfit small quirks in the training data. Reducing vocabulary size through cutoff thresholds or normalization often improves generalization in these models.

For neural models, vocabulary size mostly affects the embedding layer, which is usually one of the largest parameter matrices in the entire model. If you double the vocabulary size, you roughly double the size of the embedding matrix. That increases model memory footprint, slows training, and can increase latency at inference time. It also means the model needs to learn good representations for many more tokens, including rare ones that barely appear in training. This usually leads to poorer quality embeddings for the long tail of words.

A smaller, well-designed vocabulary tends to help neural models. Subword tokenization is popular for exactly this reason. By breaking rare words into shared pieces, you keep the vocabulary compact while still handling new or unusual words. This improves robustness and reduces memory costs. Transformers in particular benefit from modest vocabularies, because attention layers operate on token sequences. If the vocabulary is too large, you end up with many tokens that carry very little training signal, which affects embedding quality and increases the chance of OOV-like behavior.

There is a performance trade-off, though. A vocabulary that is too small forces words to be broken into many subwords or characters. This increases sequence length, which slows down transformers because attention cost grows with the square of sequence length. So you do not want the vocabulary to be too large or too small. You want it just large enough to keep sequences reasonable, but not so large that the embedding matrix becomes massive.

In short, large vocabularies increase memory use, slow down models, and can reduce accuracy by scattering training signal across too many rare features. Small vocabularies reduce memory and often improve generalization, but can lengthen sequences. Good vocabularies strike a balance between these two extremes.

How would you prevent overfitting in a linear text classifier?

I’d answer this in an interview by focusing on the practical techniques that actually matter for linear models like logistic regression, linear SVMs, or even ridge classifiers. Most of the overfitting risk comes from high-dimensional text features, so the goal is to control that complexity.

The first thing I look at is regularization. Linear text models usually respond very well to L2 regularization because it discourages large, unstable weights on rare features. If the model is still overfitting, I try increasing the regularization strength or experimenting with L1, which can prune away uninformative features entirely. In many sparse text problems, the regularization coefficient ends up being the most important hyperparameter.

Vocabulary control is another big lever. Raw text produces huge feature spaces, and most rare words do more harm than good. I prevent overfitting by trimming the vocabulary: remove words that appear too rarely, cap the vocabulary at the top N terms, or merge variants using light normalization such as lowercasing or lemmatizing. Reducing the feature space usually has a strong stabilizing effect on the classifier.

Feature weighting also helps. Using TF-IDF instead of raw counts typically reduces the dominance of very common words and spreads signal more evenly. This makes the model less sensitive to chance correlations in the training set. I have seen many cases where switching from counts to TF-IDF cuts overfitting significantly.

Cross-validation is essential. I use it not just for model selection, but to sanity-check whether the model is actually learning general patterns or just memorizing oddities of the training split. If cross-validation performance is very unstable, that usually means the feature space is too large or the regularization is too weak.

I also consider dimensionality reduction when the dataset is small. Methods like truncated SVD (LSA) can compress high-dimensional TF-IDF features into a denser representation. This often improves generalization because the model learns patterns at the concept level instead of memorizing individual words. It costs some interpretability, but it can dramatically reduce overfitting in low-data settings.

Finally, I monitor class imbalance and data leakage. If a classifier is overfitting because certain rare features correlate too perfectly with a label, balancing the data or removing leaked features can fix the issue. Simple things like stripping timestamps, IDs, or signature lines can prevent huge but meaningless coefficients.

So in practice I rely on stronger regularization, vocabulary trimming, TF-IDF weighting, cross-validation, and occasionally dimensionality reduction. These are straightforward techniques, but they go a long way because overfitting in linear text models mostly comes from the enormous and noisy feature space, not the model itself.

How is LDA different from clustering methods like K-means?

In LDA, a “topic” is really a probability distribution over words. You can think of it as a soft pattern or theme that shows which words tend to co-occur across documents. For example, a topic about sports might heavily weight words like “team,” “score,” “season,” and “coach,” while a topic about finance might weight “market,” “equity,” “inflation,” and “growth.” A document in LDA is then modeled as a mixture of these topics. So instead of saying “this document belongs to one group,” LDA says “this document is 30 percent topic A, 50 percent topic B, and 20 percent topic C.” That ability to mix topics is exactly what makes LDA flexible for text.

K-means is built on a very different idea. It tries to assign each document to exactly one cluster based on its feature vector. The cluster centers are geometric means, not distributions over words, and each document belongs to one cluster only. K-means basically asks, “which single group is this document closest to?”

LDA, on the other hand, asks a generative question: “which combination of latent word distributions best explains the words in this document?” Topics overlap, documents blend multiple topics, and the algorithm is explicitly designed to capture the idea that real-world text is multi-thematic. A legal document might be mostly about contracts but still partly about employment. A news article might mix politics, economics, and international relations. K-means has no natural way to express that.

There is also a modeling difference in how they treat words. LDA has a probabilistic foundation. It assumes documents are generated by repeatedly picking a topic, then picking a word from that topic’s distribution. K-means simply sees the document as a point in high-dimensional space and groups points that are geometrically similar. It does not try to model how language is generated.

So the short version is this: in LDA, a topic is a distribution over words, and a document is a mixture of topics. In K-means, a cluster is just a centroid, and each document belongs to one and only one cluster. LDA is better when documents naturally blend multiple themes, while K-means is better when you expect clean, non-overlapping groups.

Need more questions? How about 100 more?
Yup, we wrote the ultimate cram resources to brush up on everything NLP before your next interview. Grab your copy here.

When might NMF produce more interpretable topics than LDA?

NMF and LDA both try to uncover hidden themes in a document collection, but they learn topics in very different ways. That difference often shows up in how interpretable the topics feel.

NMF is essentially a matrix factorization method. You take the document-term matrix and try to factor it into two smaller matrices, one that represents topics as weighted combinations of words and another that represents documents as weighted combinations of topics. The only real constraint is non-negativity, which forces all weights to be additive. Because everything is built from positive combinations, the topics often look like clean “clusters” of words that commonly appear together. There is no underlying generative story. It is just trying to reconstruct the data using a small number of coherent parts.

LDA takes a probabilistic approach. It assumes documents are produced by sampling a mixture of topics and then sampling words from the chosen topics. Learning topics becomes a matter of inferring the word distributions and mixture weights that best explain the corpus. This gives LDA a stronger statistical foundation and helps it model uncertainty and document-level topic mixtures more explicitly. But it can also make the topics messier because the algorithm must satisfy the full probabilistic model rather than just find neat additive components.

NMF often produces more interpretable topics when the data is noisy or when you want topics that look like sparse, human-readable word lists. Because NMF pushes many word weights toward zero, each topic typically ends up with a small, sharp set of important words. For example, you might get a topic dominated by “battery,” “charge,” “voltage,” and “power,” which is very easy for a human to label. LDA topics can be blurrier because the probability distributions tend to include long tails of moderately weighted words.

NMF also tends to be more stable when the corpus is relatively small or when documents overlap heavily in content. LDA can spread probability mass across similar topics in subtle ways, which makes interpretation harder. In contrast, NMF simply decomposes the matrix into parts that best reconstruct the data, and those parts often look like intuitive themes.

So NMF is usually preferred when you want crisp, interpretable topics and do not need the full generative framework. LDA is preferred when you want a probabilistic model of how documents mix topics or when you need better handling of ambiguity and uncertainty across the corpus.

How would you evaluate whether a topic model is producing good topics?

When I evaluate a topic model, I usually think about two angles at the same time. One is whether the topics make sense to a human. The other is whether the model actually helps with whatever downstream task I care about. Topic models do not have a single “correct” answer, so you need a mix of qualitative and quantitative checks.

On the qualitative side, I look at the top words for each topic and ask whether they form a coherent theme. A good topic usually has a small set of strongly related words, like a finance topic with “market,” “equity,” “stocks,” and “rates.” If a topic mixes unrelated terms, or if all topics look almost identical, that is a red flag. I also read a handful of documents that the model assigns strongly to each topic to see if the themes show up in actual text. Humans are surprisingly good at spotting nonsense topics, so this manual check is worthwhile.

On the quantitative side, I use metrics like topic coherence. Coherence checks how often the top words in a topic appear together in real documents. Higher coherence usually means the topic is more meaningful. I also sometimes look at perplexity for probabilistic models, although perplexity does not always correlate with interpretability. In applied settings, I check whether the topic features improve performance on downstream tasks such as clustering, classification, or recommendation. If adding topic proportions does not help anything, the topics might not be capturing useful structure.

Choosing the number of topics is difficult because there is no natural definition of what counts as a “topic” in real text. If you choose too few topics, different themes get blended together and each topic becomes vague. If you choose too many, the model starts breaking a single theme into lots of tiny, overly specific topics that are hard to interpret. The best number depends on the corpus size, the domain, how noisy the documents are, and what you want to do with the topics.

Topic models also tend to be unstable as you vary the topic count. A small increase can split a clean topic into two mediocre ones, or merge distinct themes in ways that do not make sense. The optimal value is usually not a sharp point but a range. In practice, I pick a few candidate values, train models at each, compare their coherence scores, and then actually inspect the topics to see what makes the most sense for the application. It is a mix of data-driven tuning and human judgment because the model cannot know what themes are meaningful to your users or your task.

So evaluating topic models is partly about metrics, partly about human inspection, and partly about how well the topics support the goal of the project. And the number of topics is hard to pick because language does not naturally divide into clean buckets; you have to decide what level of granularity is useful for your needs.

Why is BM25 generally better than plain TF-IDF for search?

The basic idea behind BM25 is that it scores how relevant a document is to a query by looking at two things: how often the query terms appear in the document, and how rare those terms are across the whole collection. That part sounds similar to TF-IDF, but BM25 adds a couple of important adjustments that make it work better for real search.

First, BM25 does not assume that more term frequency is always better. In TF-IDF, if a word appears 50 times in a document, that document looks extremely relevant. But in practice, usefulness saturates. Seeing “battery” three times already tells you that a product review is about battery life; seeing it thirty times does not tell you ten times more. BM25 models this saturation. It increases the score as frequency rises, but it flattens out quickly so extremely long or repetitive documents do not dominate.

Second, BM25 accounts for document length. TF-IDF treats all documents as if they are equally long, so long documents get an unfair advantage because they naturally contain more words. BM25 normalizes for this. If a document is very long, the term frequency is discounted. If it is short, the term frequency is treated more strongly. This length-normalization step matters a lot in web search, where a 50-word article and a 5,000-word article should not be compared directly without adjusting for length.

Third, BM25’s version of IDF is more stable than the classic IDF formula. It avoids huge swings in weight for terms that appear in many documents and prevents some of the odd behavior TF-IDF shows on very common words.

Putting all this together, BM25 usually outperforms plain TF-IDF because it behaves more like real search relevance. It rewards informative terms, but not blindly. It prevents long documents from dominating the ranking. It handles frequency in a more realistic way. And it tends to be more robust across noisy or uneven document collections.

In simple terms, TF-IDF notices which words matter. BM25 goes a step further and adjusts those signals so that the ranking makes sense for real user queries and real-world document lengths. That is why most production search engines use BM25 or a close variant of it instead of raw TF-IDF.

What does document length normalization mean and why is it important?

Document length normalization is the idea that you should adjust a document’s score based on how long it is, because long documents naturally contain more words and therefore have a higher chance of matching a query by accident. If you do not correct for this, long documents almost always look more relevant than short ones, even when they are not.

Imagine you have two documents. One is a short product description with 100 words. The other is a long user manual with 5,000 words. If the query is “battery overheating,” the long manual is far more likely to contain the words “battery” and “overheating” somewhere, just because it has so much text. A plain TF-IDF model would give it a very high score even if the discussion is just one throwaway line in a massive document. The short product description might be entirely about battery problems but still lose out because it is too short to accumulate many hits.

Length normalization fixes this imbalance by discounting term frequency in longer documents. In models like BM25, the same number of matches is considered more meaningful in a short document than in a long one. If a query term appears three times in a 100-word document, that is a strong signal. If it appears three times in a 5,000-word document, it is a much weaker signal. The normalization adjusts the score so that relevance reflects how focused a document is on the query, not just how big it is.

This matters a lot in search and ranking systems, because without length normalization you tend to heavily favor large documents and miss short but highly relevant ones. In practice, length normalization makes search results feel much more aligned with user intent and prevents the ranking from being dominated by documents that are long simply because they cover everything.

When might dense embeddings perform worse than sparse IR methods like BM25?

Dense embeddings are powerful, but there are clear situations where they actually perform worse than sparse IR methods like BM25. The main pattern is that dense embeddings shine when meaning matters, and BM25 shines when exact wording matters or the query is very specific.

Dense embeddings tend to struggle when the query depends on exact lexical matches. For example, if a user searches for “Section 14B disclosure waiver,” they want documents containing those exact terms. A dense model might generalize too much and return documents about “waivers” or “disclosures” in general. BM25, on the other hand, boosts precise term matches and usually handles these legal or technical queries better.

Another case is when the domain is full of rare or highly specialized vocabulary. Embeddings sometimes smooth these distinctions away. In biomedical search, the terms “HER2,” “HER3,” and “HER4” are completely different things, but dense models can cluster them too closely because they share similar context. BM25 does not smooth anything; it treats each one as its own term, which often leads to more reliable results.

Dense models also struggle in low-data or long-tail environments. If the embedding model has not seen enough examples of a rare concept, it will not learn a precise vector for it. Sparse methods do not care about training data. If the word appears in the document, BM25 can match it. This is especially important in internal document search, where a large portion of queries involve obscure terms.

Precision-heavy retrieval is another place where BM25 outperforms. If the correctness of each term matters, as in compliance search or code search, semantic similarity can lead to misleading matches. BM25 returns documents that contain the exact tokens, which is often what the user actually wants.

Adversarial cases also hurt dense models. A single-token change like “breach” versus “beach” may not be well separated in embedding space, but BM25 treats them as completely different words. For domains sensitive to spelling or specific terminology, sparse methods hold up better.

Finally, dense retrieval is sensitive to domain shift. If the embedding model was trained on generic text and you deploy it on legal, medical, or financial documents, its similarity judgments may be off. BM25 does not depend on pretraining, so it tends to be more robust when the vocabulary changes abruptly.

So dense embeddings can underperform whenever exact terms matter, the domain contains many rare or technical words, training data is limited, or there is domain drift. BM25 remains strong in these cases because it matches on the actual words the user typed, without smoothing or semantic guessing.

How would you choose between LDA and NMF for topic modeling on short texts like tweets?

For short texts like tweets, I would choose between LDA and NMF by thinking about what each model needs to succeed and what short texts actually look like. Short messages have very few words, lots of slang, and very sparse signals. That environment does not treat all topic models equally.

LDA relies on a generative assumption: each document is a mixture of topics, and each topic is a distribution over words. That assumption works best when documents are long enough for multiple topics to appear. Tweets usually contain one idea, maybe two at most, and only a handful of words to express it. Because LDA expects richer word co-occurrence patterns than a tweet can provide, it often ends up producing noisy or unstable topics. LDA can still work on short texts, but it usually needs tricks like pooling tweets, using bigrams, or adding priors to stabilize the word distributions.

NMF behaves differently. It treats the document-term matrix as something you want to break into additive parts. There is no generative story. It just looks for groups of words that tend to show up together across the dataset. Because it does not need each document to contain many topic signals, it often handles short text better. NMF also tends to produce sparser, more sharply defined topics, which makes them easier to interpret when documents are tiny and noisy. For example, on tweets you often see clean topics like {“iphone,” “apple,” “ios”} or {“election,” “vote,” “poll”} without the probabilistic blur that LDA sometimes introduces.

Another practical difference is that NMF is more stable with extremely sparse matrices. Tweet corpora are exactly that: huge vocabularies, very few tokens per document, and tons of rare words. LDA can struggle to estimate meaningful distributions under that level of sparsity, while NMF tends to settle on clearer patterns because it is directly optimizing the matrix reconstruction.

So for short texts like tweets, I usually start with NMF because it gives crisp, interpretable topics and handles sparse data well. I consider LDA only if I have ways to enrich the documents, such as grouping tweets by user or hashtag, or if I specifically need a probabilistic model for downstream tasks. In most real applications involving short text, NMF simply fits the data structure better.

What preprocessing steps would you apply before using TF-IDF in production?

Before using TF-IDF in production, I focus on preprocessing steps that make the vocabulary cleaner, more stable, and less sparse. TF-IDF is sensitive to noise because every unique token becomes a feature, so the goal is to reduce unnecessary variation while keeping the information that matters for the task.

I usually start with basic normalization. I lowercase the text so that “Apple” and “apple” are not treated as two separate words unless there is a strong reason to keep case. I remove obvious noise like HTML tags, control characters, duplicated whitespace, or stray markup. I also normalize common Unicode quirks so that accented or stylized characters do not create useless new tokens.

Next, I handle tokenization carefully. I use a tokenizer that splits punctuation in a predictable way, keeps numbers as their own tokens, and preserves things like punctuation that might matter for the domain. For example, in product reviews, exclamation marks can carry sentiment, so I might preserve them. In legal or medical text, I keep hyphenated terms intact because they often represent meaningful concepts.

I clean up vocabulary outliers. Very rare words, typos, and IDs often explode the feature space without adding value. I remove extremely low-frequency terms or collapse them into an “unknown” bucket. Light spelling correction can help if the domain has lots of predictable typos, but I avoid aggressive correction because it can distort meaning.

Depending on the task, I sometimes remove stopwords. TF-IDF already down-weights common words, but removing stopwords can still shrink the vocabulary and reduce noise. I do not rely on a generic stopword list. Instead, I check which high-frequency words actually add no predictive value for the specific domain. If the task depends on function words, like in sentiment or intent classification, I keep more of them.

I often apply light normalization to word forms. Lemmatization can reduce variation and make TF-IDF vectors more compact, especially in classical ML settings. Stemming is more aggressive and can distort meaning, so I use it only if the domain is tolerant of rough normalization.

I also decide how to handle numbers, URLs, usernames, and other non-lexical elements. In many production settings, I replace URLs with a special token, mask usernames, and standardize numbers so that “123” and “456” are treated similarly. This keeps the vocabulary stable without losing the signal that a number or link was present.

Finally, I monitor the vocabulary over time. In production, language drifts. New slang, product names, or formatting patterns appear. I periodically rebuild or prune the vocabulary to keep TF-IDF features relevant and avoid unbounded growth.

So the core idea is simple: clean what creates noise or unnecessary variation, preserve what carries meaning for the task, and keep the vocabulary controlled. TF-IDF works best when the text is predictable and the feature space is manageable, so preprocessing is mostly about shaping the input to meet those conditions.

Final Takeways

It’s easy to think of classical NLP methods as relics from the pre-transformer era. After all, modern models can capture context, syntax, and semantics in ways that bag-of-words representations never could.

But in practice, the story is much more balanced.

Sparse representations like TF-IDF and ranking methods like BM25 still power huge parts of the real world. Search engines rely on them for fast lexical matching. Internal document systems use them to retrieve knowledge instantly. Even modern AI pipelines often start with a BM25 stage before handing results to dense retrieval or reranking models.

And the reason is simple: these methods solve problems that neural models don’t always solve well. They are fast, interpretable, memory-efficient, and extremely reliable when exact wording matters.

Understanding them also builds intuition for many things that show up later in modern NLP. Vocabulary design, sparsity, feature weighting, document length normalization, and overfitting in high-dimensional spaces are not just “classical NLP problems.” They are fundamental representation problems that still exist in today’s systems.

That’s why these topics appear so often in interviews and production pipelines alike.

The models may change, but the underlying ideas about how we represent language—and how we retrieve useful information from it—continue to matter.

BuildML

Discussion about this post

Ready for more?