ML Natural Language Processing Basics

Natural Language Processing (NLP) is the branch of Machine Learning that enables computers to understand, interpret, and generate human language — text and speech. Every time a search engine understands a query, a chatbot answers a question, or a translator converts between languages, NLP is at work.

Why Language is Hard for Machines

Human language is ambiguous, context-dependent, and informal.
Machines work with numbers — not words, grammar, or meaning.

Challenges:
  Ambiguity:   "I saw the man with the telescope."
               (Did I use the telescope to see him? Or did he have it?)

  Context:     "It's not my cup of tea."
               (Not literally about a cup or tea — an idiom)

  Sarcasm:     "Oh great, another Monday." (not actually great)

  Spelling:    "teh cat sat on teh mat" (human understands, machine struggles)

  Language:    Hindi, Tamil, Swahili, Japanese — NLP must handle all.

Core NLP Pipeline

Raw Text
    │
    ▼
Text Cleaning (remove special chars, lowercase, fix encoding)
    │
    ▼
Tokenization (split text into words or subwords)
    │
    ▼
Stop Word Removal (remove "the", "is", "and"...)
    │
    ▼
Stemming / Lemmatization (reduce words to base form)
    │
    ▼
Feature Extraction (convert words to numbers)
    │
    ▼
Model Training or Inference
    │
    ▼
Output (sentiment, translation, entity, classification...)

Step 1: Tokenization

Tokenization splits raw text into individual units (tokens).

Word Tokenization:
  Input:  "Machine Learning is amazing!"
  Output: ["Machine", "Learning", "is", "amazing", "!"]

Sentence Tokenization:
  Input:  "He came. She left. It rained."
  Output: ["He came.", "She left.", "It rained."]

Subword Tokenization (used in modern models like BERT):
  Input:  "unbelievable"
  Output: ["un", "##believ", "##able"]
  
  Handles unknown or rare words by splitting into known pieces.
  "ChatGPT" → ["Chat", "##G", "##PT"]

Step 2: Stop Word Removal

Stop words are very common words that add little meaning
in most tasks. Removing them reduces noise.

Common English Stop Words:
  the, is, in, at, which, on, a, an, and, or, but, of, to

Example:
  Original: "The cat sat on the mat near the wall."
  After:    ["cat", "sat", "mat", "near", "wall"]

Note: Stop word removal is NOT always appropriate.
  Sentiment analysis: "not good" vs "good" — removing "not" changes meaning.
  Machine Translation: Every word matters.

Step 3: Stemming and Lemmatization

Both reduce words to their base form.

Stemming (crude — cuts word endings):
  "running" → "run"
  "flies"   → "fli"   ← not a real word
  "easily"  → "easili" ← not a real word

Lemmatization (uses vocabulary — returns real words):
  "running"  → "run"
  "flies"    → "fly"
  "better"   → "good"
  "was"      → "be"

┌─────────────────┬──────────────────┬──────────────────────────────┐
│ Feature         │ Stemming         │ Lemmatization                │
├─────────────────┼──────────────────┼──────────────────────────────┤
│ Speed           │ Fast             │ Slower                       │
│ Result          │ May not be a word│ Always a real word           │
│ Accuracy        │ Lower            │ Higher                       │
│ Use When        │ Speed matters    │ Accuracy matters             │
└─────────────────┴──────────────────┴──────────────────────────────┘

Feature Extraction: Converting Text to Numbers

Bag of Words (BoW)

Each document becomes a vector of word counts.

Vocabulary (from all documents): [cat, sat, mat, dog, ran]

Document 1: "The cat sat on the mat"
Vector:     [1, 1, 1, 0, 0]
             cat sat mat dog ran

Document 2: "The dog ran on the mat"
Vector:     [0, 0, 1, 1, 1]
             cat sat mat dog ran

Problem: Loses word order entirely.
"Dog bites man" and "Man bites dog" produce the same vector.

TF-IDF (Term Frequency – Inverse Document Frequency)

BoW gives equal weight to all words.
TF-IDF gives higher weight to words that are rare across documents
but frequent in a specific document.

TF  (Term Frequency) = Count of word in document / Total words in doc
IDF (Inverse Doc Freq) = log(Total documents / Documents containing word)
TF-IDF = TF × IDF

Example:
  Word "machine" appears in 1 of 100 documents:
    IDF = log(100/1) = 4.6 ← High — rare word, gets high weight

  Word "the" appears in 100 of 100 documents:
    IDF = log(100/100) = 0 ← Zero weight — appears everywhere, useless

TF-IDF automatically down-weights stop words without a stop word list.

Word Embeddings (Word2Vec, GloVe)

BoW and TF-IDF ignore the meaning and relationships between words.
Word Embeddings represent each word as a dense vector of numbers
where similar words have similar vectors.

Word2Vec example (simplified, 3 dimensions):
  "King"    = [0.80, 0.12, 0.65]
  "Queen"   = [0.78, 0.82, 0.63]
  "Man"     = [0.72, 0.10, 0.60]
  "Woman"   = [0.70, 0.80, 0.58]
  "Apple"   = [0.10, 0.50, 0.20]

"King" and "Queen" have very similar vectors.
"Apple" is far from all of them — different semantic space.

Famous relationship:
  King - Man + Woman ≈ Queen
  [0.80-0.72+0.70, 0.12-0.10+0.80, 0.65-0.60+0.58]
  = [0.78, 0.82, 0.63] ≈ Queen vector ✓

Word embeddings capture analogies, synonyms, and relationships.

Common NLP Tasks

Sentiment Analysis

Input:  "The product quality is excellent but shipping was terrible."
Output: Mixed sentiment → Overall: Neutral or context-split

Use cases: Social media monitoring, product reviews, customer feedback.

Simple approach: Train classifier on labeled reviews.
  Positive: 5-star reviews
  Negative: 1–2 star reviews
  Features: TF-IDF vectors or word embeddings
  Model: Logistic Regression, Naive Bayes, or LSTM

Named Entity Recognition (NER)

Input:  "Tata Motors announced profits in Mumbai last Tuesday."
Output: 
  "Tata Motors" → ORGANIZATION
  "Mumbai"      → LOCATION
  "last Tuesday"→ DATE

Use cases: Information extraction, search indexing, news summarization.

Text Classification

Input:  Email text
Output: SPAM or NOT SPAM

Input:  News article
Output: Technology / Sports / Business / Entertainment / Politics

Use cases: Content moderation, document routing, intent detection.

Machine Translation

Input:  "Thank you for your help." (English)
Output: "आपकी सहायता के लिए धन्यवाद।" (Hindi)

Modern approach: Transformer-based models (like Google Translate).
Older approach: Encoder-Decoder LSTM with attention mechanism.

Attention Mechanism

Problem with encoder-decoder LSTM for translation:
  Entire source sentence is compressed into one fixed-length vector.
  Long sentences lose information at the ends.

Attention solves this:
  When translating each word, the model looks BACK at the full
  source sentence and decides which source words are most relevant
  for the current output word.

Example (English to Hindi):
  Source: "The cat sat on the mat."
  
  Translating "बिल्ली" (cat):
    Model attends strongly to "cat" in the source.
    Attends weakly to "sat", "mat", "the".

  Translating "बैठी" (sat):
    Model attends strongly to "sat" in the source.

  Attention scores (which source word to focus on):
    "The": 0.05  "cat": 0.78  "sat": 0.08  "on": 0.04  "mat": 0.05
    → Strong focus on "cat" when generating "बिल्ली"

Transformers and BERT

Transformers replaced RNNs for most NLP tasks.
They use "self-attention" — every word in a sentence attends to
every other word simultaneously (not step by step).

BERT (Bidirectional Encoder Representations from Transformers):
  Pre-trained on: All of Wikipedia + BooksCorpus
  Training task: Predict masked words in sentences
  Result: Deep contextual understanding of language

  Fine-tuning for downstream tasks:
    Add a classification head → Sentiment Analysis
    Add a token labeling head → Named Entity Recognition
    Add a span extraction head → Question Answering

GPT models (ChatGPT):
  Decoder-only Transformer
  Pre-trained to predict the next word
  Scale: GPT-4 has ~1 trillion parameters

NLP Application Summary

┌────────────────────────────┬───────────────────────────────────────┐
│ NLP Task                   │ Industry Example                      │
├────────────────────────────┼───────────────────────────────────────┤
│ Sentiment Analysis         │ Brand monitoring, review summarization│
│ Named Entity Recognition   │ Resume parsing, news event extraction │
│ Text Classification        │ Spam filters, support ticket routing  │
│ Machine Translation        │ Google Translate, localization        │
│ Question Answering         │ Chatbots, FAQ systems                 │
│ Text Summarization         │ News digest, document summarization   │
│ Text Generation            │ Content writing aids, code generation │
│ Speech to Text             │ Voice assistants, meeting transcripts │
└────────────────────────────┴───────────────────────────────────────┘

Leave a Comment