ML Natural Language Processing Basics
Natural Language Processing (NLP) is the branch of Machine Learning that enables computers to understand, interpret, and generate human language — text and speech. Every time a search engine understands a query, a chatbot answers a question, or a translator converts between languages, NLP is at work.
Why Language is Hard for Machines
Human language is ambiguous, context-dependent, and informal.
Machines work with numbers — not words, grammar, or meaning.
Challenges:
Ambiguity: "I saw the man with the telescope."
(Did I use the telescope to see him? Or did he have it?)
Context: "It's not my cup of tea."
(Not literally about a cup or tea — an idiom)
Sarcasm: "Oh great, another Monday." (not actually great)
Spelling: "teh cat sat on teh mat" (human understands, machine struggles)
Language: Hindi, Tamil, Swahili, Japanese — NLP must handle all.
Core NLP Pipeline
Raw Text
│
▼
Text Cleaning (remove special chars, lowercase, fix encoding)
│
▼
Tokenization (split text into words or subwords)
│
▼
Stop Word Removal (remove "the", "is", "and"...)
│
▼
Stemming / Lemmatization (reduce words to base form)
│
▼
Feature Extraction (convert words to numbers)
│
▼
Model Training or Inference
│
▼
Output (sentiment, translation, entity, classification...)
Step 1: Tokenization
Tokenization splits raw text into individual units (tokens). Word Tokenization: Input: "Machine Learning is amazing!" Output: ["Machine", "Learning", "is", "amazing", "!"] Sentence Tokenization: Input: "He came. She left. It rained." Output: ["He came.", "She left.", "It rained."] Subword Tokenization (used in modern models like BERT): Input: "unbelievable" Output: ["un", "##believ", "##able"] Handles unknown or rare words by splitting into known pieces. "ChatGPT" → ["Chat", "##G", "##PT"]
Step 2: Stop Word Removal
Stop words are very common words that add little meaning in most tasks. Removing them reduces noise. Common English Stop Words: the, is, in, at, which, on, a, an, and, or, but, of, to Example: Original: "The cat sat on the mat near the wall." After: ["cat", "sat", "mat", "near", "wall"] Note: Stop word removal is NOT always appropriate. Sentiment analysis: "not good" vs "good" — removing "not" changes meaning. Machine Translation: Every word matters.
Step 3: Stemming and Lemmatization
Both reduce words to their base form. Stemming (crude — cuts word endings): "running" → "run" "flies" → "fli" ← not a real word "easily" → "easili" ← not a real word Lemmatization (uses vocabulary — returns real words): "running" → "run" "flies" → "fly" "better" → "good" "was" → "be" ┌─────────────────┬──────────────────┬──────────────────────────────┐ │ Feature │ Stemming │ Lemmatization │ ├─────────────────┼──────────────────┼──────────────────────────────┤ │ Speed │ Fast │ Slower │ │ Result │ May not be a word│ Always a real word │ │ Accuracy │ Lower │ Higher │ │ Use When │ Speed matters │ Accuracy matters │ └─────────────────┴──────────────────┴──────────────────────────────┘
Feature Extraction: Converting Text to Numbers
Bag of Words (BoW)
Each document becomes a vector of word counts.
Vocabulary (from all documents): [cat, sat, mat, dog, ran]
Document 1: "The cat sat on the mat"
Vector: [1, 1, 1, 0, 0]
cat sat mat dog ran
Document 2: "The dog ran on the mat"
Vector: [0, 0, 1, 1, 1]
cat sat mat dog ran
Problem: Loses word order entirely.
"Dog bites man" and "Man bites dog" produce the same vector.
TF-IDF (Term Frequency – Inverse Document Frequency)
BoW gives equal weight to all words.
TF-IDF gives higher weight to words that are rare across documents
but frequent in a specific document.
TF (Term Frequency) = Count of word in document / Total words in doc
IDF (Inverse Doc Freq) = log(Total documents / Documents containing word)
TF-IDF = TF × IDF
Example:
Word "machine" appears in 1 of 100 documents:
IDF = log(100/1) = 4.6 ← High — rare word, gets high weight
Word "the" appears in 100 of 100 documents:
IDF = log(100/100) = 0 ← Zero weight — appears everywhere, useless
TF-IDF automatically down-weights stop words without a stop word list.
Word Embeddings (Word2Vec, GloVe)
BoW and TF-IDF ignore the meaning and relationships between words. Word Embeddings represent each word as a dense vector of numbers where similar words have similar vectors. Word2Vec example (simplified, 3 dimensions): "King" = [0.80, 0.12, 0.65] "Queen" = [0.78, 0.82, 0.63] "Man" = [0.72, 0.10, 0.60] "Woman" = [0.70, 0.80, 0.58] "Apple" = [0.10, 0.50, 0.20] "King" and "Queen" have very similar vectors. "Apple" is far from all of them — different semantic space. Famous relationship: King - Man + Woman ≈ Queen [0.80-0.72+0.70, 0.12-0.10+0.80, 0.65-0.60+0.58] = [0.78, 0.82, 0.63] ≈ Queen vector ✓ Word embeddings capture analogies, synonyms, and relationships.
Common NLP Tasks
Sentiment Analysis
Input: "The product quality is excellent but shipping was terrible." Output: Mixed sentiment → Overall: Neutral or context-split Use cases: Social media monitoring, product reviews, customer feedback. Simple approach: Train classifier on labeled reviews. Positive: 5-star reviews Negative: 1–2 star reviews Features: TF-IDF vectors or word embeddings Model: Logistic Regression, Naive Bayes, or LSTM
Named Entity Recognition (NER)
Input: "Tata Motors announced profits in Mumbai last Tuesday." Output: "Tata Motors" → ORGANIZATION "Mumbai" → LOCATION "last Tuesday"→ DATE Use cases: Information extraction, search indexing, news summarization.
Text Classification
Input: Email text Output: SPAM or NOT SPAM Input: News article Output: Technology / Sports / Business / Entertainment / Politics Use cases: Content moderation, document routing, intent detection.
Machine Translation
Input: "Thank you for your help." (English) Output: "आपकी सहायता के लिए धन्यवाद।" (Hindi) Modern approach: Transformer-based models (like Google Translate). Older approach: Encoder-Decoder LSTM with attention mechanism.
Attention Mechanism
Problem with encoder-decoder LSTM for translation:
Entire source sentence is compressed into one fixed-length vector.
Long sentences lose information at the ends.
Attention solves this:
When translating each word, the model looks BACK at the full
source sentence and decides which source words are most relevant
for the current output word.
Example (English to Hindi):
Source: "The cat sat on the mat."
Translating "बिल्ली" (cat):
Model attends strongly to "cat" in the source.
Attends weakly to "sat", "mat", "the".
Translating "बैठी" (sat):
Model attends strongly to "sat" in the source.
Attention scores (which source word to focus on):
"The": 0.05 "cat": 0.78 "sat": 0.08 "on": 0.04 "mat": 0.05
→ Strong focus on "cat" when generating "बिल्ली"
Transformers and BERT
Transformers replaced RNNs for most NLP tasks.
They use "self-attention" — every word in a sentence attends to
every other word simultaneously (not step by step).
BERT (Bidirectional Encoder Representations from Transformers):
Pre-trained on: All of Wikipedia + BooksCorpus
Training task: Predict masked words in sentences
Result: Deep contextual understanding of language
Fine-tuning for downstream tasks:
Add a classification head → Sentiment Analysis
Add a token labeling head → Named Entity Recognition
Add a span extraction head → Question Answering
GPT models (ChatGPT):
Decoder-only Transformer
Pre-trained to predict the next word
Scale: GPT-4 has ~1 trillion parameters
NLP Application Summary
┌────────────────────────────┬───────────────────────────────────────┐ │ NLP Task │ Industry Example │ ├────────────────────────────┼───────────────────────────────────────┤ │ Sentiment Analysis │ Brand monitoring, review summarization│ │ Named Entity Recognition │ Resume parsing, news event extraction │ │ Text Classification │ Spam filters, support ticket routing │ │ Machine Translation │ Google Translate, localization │ │ Question Answering │ Chatbots, FAQ systems │ │ Text Summarization │ News digest, document summarization │ │ Text Generation │ Content writing aids, code generation │ │ Speech to Text │ Voice assistants, meeting transcripts │ └────────────────────────────┴───────────────────────────────────────┘
