DS Natural Language Processing (NLP)

Natural Language Processing (NLP) is the branch of data science that enables computers to understand, process, and generate human language. Text data — customer reviews, social media posts, emails, news articles, and support tickets — is one of the richest and most abundant data sources available. NLP unlocks the information hidden inside it.

What NLP Solves

NLP TaskReal-World Application
Sentiment AnalysisDetermine if a product review is positive or negative
Text ClassificationRoute a support ticket to the correct department
Spam DetectionFilter unwanted emails from an inbox
Named Entity RecognitionExtract names, locations, and dates from news articles
Machine TranslationTranslate text between languages automatically
Text SummarisationCompress a long document into key points
Chatbots / Q&AAnswer customer questions in natural language

The NLP Pipeline

Raw Text Input
      │
      ▼
Text Cleaning
(lowercase, remove punctuation, remove HTML)
      │
      ▼
Tokenisation
(split text into individual words or sentences)
      │
      ▼
Stop Word Removal
(remove "the", "is", "and", "a" — low-information words)
      │
      ▼
Stemming / Lemmatisation
(reduce words to their root form)
      │
      ▼
Feature Extraction
(convert text into numbers: Bag of Words, TF-IDF, Word Embeddings)
      │
      ▼
Model Training
(classification, clustering, regression on text features)
      │
      ▼
Prediction / Output

Step 1 – Text Cleaning

import re
import string

def clean_text(text):
    # Lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)
    
    # Remove HTML tags
    text = re.sub(r"<.*?>", "", text)
    
    # Remove punctuation and numbers
    text = re.sub(r"[^a-z\s]", "", text)
    
    # Remove extra whitespace
    text = " ".join(text.split())
    
    return text

# Test
raw = "Amazing product! 5/5 stars ⭐ Visit https://shop.com for more. #BestBuy"
clean = clean_text(raw)
print("Raw   :", raw)
print("Cleaned:", clean)

Output:

Raw   : Amazing product! 5/5 stars ⭐ Visit https://shop.com for more. #BestBuy
Cleaned: amazing product stars visit for more bestbuy

Step 2 – Tokenisation

Tokenisation splits text into individual units called tokens. Word tokenisation produces one token per word. Sentence tokenisation produces one token per sentence.

import nltk
nltk.download("punkt", quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize

text = "The camera quality is excellent. Battery life could be better. Overall a great phone!"

# Word tokens
word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)

# Sentence tokens
sent_tokens = sent_tokenize(text)
print("\nSentence Tokens:")
for i, s in enumerate(sent_tokens, 1):
    print(f"  Sentence {i}: {s}")

Output:

Word Tokens: ['The', 'camera', 'quality', 'is', 'excellent', '.', 'Battery',
              'life', 'could', 'be', 'better', '.', 'Overall', 'a', 'great', 'phone', '!']

Sentence 1: The camera quality is excellent.
Sentence 2: Battery life could be better.
Sentence 3: Overall a great phone!

Step 3 – Stop Word Removal

Stop words are common words that appear in almost every sentence but carry very little meaning — "the", "is", "a", "and". Removing them reduces noise and speeds up model training.

from nltk.corpus import stopwords
nltk.download("stopwords", quiet=True)

stop_words = set(stopwords.words("english"))

# Sample tokens
tokens = ["the", "camera", "quality", "is", "excellent", "a", "very", "good", "phone"]

# Remove stop words
filtered = [word for word in tokens if word not in stop_words]

print("Before:", tokens)
print("After :", filtered)
print("Removed:", set(tokens) - set(filtered))

Output:

Before: ['the', 'camera', 'quality', 'is', 'excellent', 'a', 'very', 'good', 'phone']
After : ['camera', 'quality', 'excellent', 'good', 'phone']
Removed: {'is', 'the', 'a', 'very'}

Step 4 – Stemming and Lemmatisation

Stemming and Lemmatisation both reduce words to a base form — so "running", "runs", and "ran" all map to the same root. This reduces vocabulary size and prevents the model from treating related words as unrelated.

Diagram – Stemming vs Lemmatisation

Word            Stemming (Porter)    Lemmatisation
─────────────   ──────────────────   ─────────────
"running"    →  "run"                "run"
"better"     →  "better"             "good"     ← uses context
"studies"    →  "studi"              "study"
"caring"     →  "care"               "care"
"feet"       →  "feet"               "foot"     ← uses dictionary

Stemming:  Fast, crude — chops off endings (may not be real words)
Lemmatisation: Slower, accurate — uses dictionary to find root
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet", quiet=True)

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better", "caring", "feet", "wolves"]

print(f"{'Word':<12} {'Stemmed':<12} {'Lemmatised'}")
print("-" * 36)
for word in words:
    stemmed     = stemmer.stem(word)
    lemmatised  = lemmatizer.lemmatize(word)
    print(f"{word:<12} {stemmed:<12} {lemmatised}")

Output:

Word         Stemmed      Lemmatised
------------------------------------
running      run          running
studies      studi        study
better       better       better
caring       care         caring
feet         feet         foot
wolves       wolv         wolf

Step 5 – Feature Extraction

Machine learning models require numbers as input, not raw text. Feature extraction converts text into numeric vectors that capture the information in the words.

Bag of Words (BoW)

Bag of Words counts how many times each word in the vocabulary appears in a document. The word order and grammar are ignored — only the word counts matter.

Corpus (3 documents):
  Doc 1: "the food was great"
  Doc 2: "the service was bad"
  Doc 3: "great food great service"

Vocabulary: [bad, food, great, service, the, was]

BoW Matrix:
         bad  food  great  service  the  was
Doc 1:    0     1     1       0      1    1
Doc 2:    1     0     0       1      1    1
Doc 3:    0     1     2       1      0    0

Each row is a document, each column is a word count.
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "the food was great",
    "the service was bad",
    "great food great service"
]

vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBoW Matrix:\n", X_bow.toarray())

TF-IDF – Term Frequency–Inverse Document Frequency

TF-IDF improves on Bag of Words by down-weighting words that appear in many documents (like "the", "was") and up-weighting words that are rare but important in a specific document. This better reflects what makes each document unique.

TF  = Count of word in document / Total words in document
IDF = log(Total documents / Documents containing the word)
TF-IDF = TF × IDF

Word "great" in Doc 3:
TF   = 2/4 = 0.50          (appears twice in 4-word doc)
IDF  = log(3/2) = 0.405    (appears in 2 of 3 documents)
Score = 0.50 × 0.405 = 0.20

Word "bad" in Doc 2:
TF   = 1/4 = 0.25          (appears once)
IDF  = log(3/1) = 1.099    (appears in only 1 document → rarer → higher IDF)
Score = 0.25 × 1.099 = 0.27
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)

import pandas as pd
df_tfidf = pd.DataFrame(
    X_tfidf.toarray(),
    columns=tfidf.get_feature_names_out(),
    index=["Doc1", "Doc2", "Doc3"]
)
print(df_tfidf.round(3))

Sentiment Analysis – Full Example

Sentiment analysis classifies text as positive, negative, or neutral. This complete example builds a sentiment classifier for product reviews using TF-IDF and Logistic Regression.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

# Sample labelled reviews (1 = positive, 0 = negative)
reviews = [
    "This product is absolutely amazing and works perfectly",
    "Very disappointed with the quality, completely broken",
    "Excellent value for money, highly recommend",
    "Terrible experience, will never buy again",
    "Outstanding product, exceeded all expectations",
    "Complete waste of money, horrible quality",
    "Great build quality and fast delivery",
    "Does not work as described, very frustrating",
    "Love this product! Best purchase this year",
    "Poor quality, stopped working after 2 days",
    "Fantastic item, exactly as described",
    "Awful experience, customer service unhelpful",
    "Superb quality and arrived on time",
    "Broken on arrival, very disappointed",
    "Brilliant product, highly recommended to everyone"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

# Split
X_train, X_test, y_train, y_test = train_test_split(
    reviews, labels, test_size=0.3, random_state=42
)

# Build pipeline: TF-IDF → Logistic Regression
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=1)),
    ("clf",   LogisticRegression(max_iter=1000, random_state=42))
])

# Train
pipeline.fit(X_train, y_train)

# Evaluate
y_pred = pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

# Predict on new reviews
new_reviews = [
    "The battery life is incredible and the display is sharp",
    "Stopped working after one week, very annoyed"
]

predictions = pipeline.predict(new_reviews)
for review, pred in zip(new_reviews, predictions):
    label = "Positive 😊" if pred == 1 else "Negative 😞"
    print(f"\n{label}: {review}")

Output:

Accuracy: 1.0

Positive 😊: The battery life is incredible and the display is sharp
Negative 😞: Stopped working after one week, very annoyed

N-Grams – Capturing Word Sequences

A unigram is a single word. A bigram is a pair of consecutive words. N-grams capture context that single words miss — "not good" is very different from "good" and "not" treated separately.

from sklearn.feature_extraction.text import CountVectorizer

text = ["the product is not good at all"]

# Unigrams (single words)
cv1 = CountVectorizer(ngram_range=(1, 1))
print("Unigrams:", cv1.fit_transform(text).toarray())
print("Vocab:", cv1.get_feature_names_out())

# Bigrams (word pairs)
cv2 = CountVectorizer(ngram_range=(2, 2))
print("\nBigrams:", cv2.fit_transform(text).toarray())
print("Vocab:", cv2.get_feature_names_out())

Output (Bigrams):

Vocab: ['all at', 'at all', 'good at', 'is not', 'not good', 'product is', 'the product']

→ "not good" is captured as a bigram → model understands negation
→ "is not" signals negation even before seeing the adjective

Named Entity Recognition with spaCy

Named Entity Recognition (NER) identifies real-world entities in text — people, organisations, locations, dates, money, and more. spaCy is the fastest and most accurate NER library in Python.

# Installation: pip install spacy
# python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")

text = """Apple announced a new iPhone at its event in California on September 12th.
          The device costs $999 and will be available from Tim Cook's team next month."""

doc = nlp(text)

print("Named Entities Found:")
print(f"{'Entity Text':<30} {'Label':<15} {'Meaning'}")
print("-" * 65)

entity_meanings = {
    "ORG": "Organisation",
    "PRODUCT": "Product",
    "GPE": "Country/City/State",
    "DATE": "Date or Period",
    "MONEY": "Monetary Value",
    "PERSON": "Person Name"
}

for ent in doc.ents:
    meaning = entity_meanings.get(ent.label_, ent.label_)
    print(f"{ent.text:<30} {ent.label_:<15} {meaning}")

Output:

Named Entities Found:
Entity Text                    Label           Meaning
-----------------------------------------------------------------
Apple                          ORG             Organisation
iPhone                         PRODUCT         Product
California                     GPE             Country/City/State
September 12th                 DATE            Date or Period
$999                           MONEY           Monetary Value
Tim Cook                       PERSON          Person Name

Word Frequency Analysis

from collections import Counter
import matplotlib.pyplot as plt

sample_text = """data science machine learning artificial intelligence
                 python data analysis machine learning deep learning
                 data science python programming statistics machine"""

# Tokenise and count
words  = sample_text.lower().split()
counts = Counter(words)
top10  = counts.most_common(10)

words_list, freq_list = zip(*top10)

plt.figure(figsize=(9, 4))
plt.bar(words_list, freq_list, color="steelblue", edgecolor="black")
plt.title("Top 10 Most Frequent Words")
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.savefig("word_frequency.png")
plt.show()

Key NLP Libraries in Python

LibraryBest ForInstall
NLTKTokenisation, stemming, stop words, learning NLP conceptspip install nltk
spaCyFast NER, dependency parsing, production NLP pipelinespip install spacy
Scikit-learnBoW, TF-IDF, text classification modelspip install scikit-learn
Transformers (HuggingFace)State-of-the-art BERT, GPT models, semantic searchpip install transformers
GensimWord2Vec embeddings, topic modelling (LDA)pip install gensim

Summary

  • NLP converts unstructured text into structured numeric data that machine learning models can process
  • The NLP pipeline includes cleaning, tokenisation, stop word removal, stemming/lemmatisation, and feature extraction
  • Bag of Words counts word occurrences; TF-IDF weights words by how unique they are to a document
  • N-grams capture word sequences and context that single-word features miss
  • Logistic Regression combined with TF-IDF produces a strong baseline for text classification tasks
  • Named Entity Recognition extracts structured information — names, organisations, dates — from free text
  • spaCy provides fast, accurate NER and text parsing for production-grade applications
  • HuggingFace Transformers provide access to state-of-the-art language models for advanced NLP tasks

Leave a Comment