DS Natural Language Processing (NLP)
Natural Language Processing (NLP) is the branch of data science that enables computers to understand, process, and generate human language. Text data — customer reviews, social media posts, emails, news articles, and support tickets — is one of the richest and most abundant data sources available. NLP unlocks the information hidden inside it.
What NLP Solves
| NLP Task | Real-World Application |
|---|---|
| Sentiment Analysis | Determine if a product review is positive or negative |
| Text Classification | Route a support ticket to the correct department |
| Spam Detection | Filter unwanted emails from an inbox |
| Named Entity Recognition | Extract names, locations, and dates from news articles |
| Machine Translation | Translate text between languages automatically |
| Text Summarisation | Compress a long document into key points |
| Chatbots / Q&A | Answer customer questions in natural language |
The NLP Pipeline
Raw Text Input
│
▼
Text Cleaning
(lowercase, remove punctuation, remove HTML)
│
▼
Tokenisation
(split text into individual words or sentences)
│
▼
Stop Word Removal
(remove "the", "is", "and", "a" — low-information words)
│
▼
Stemming / Lemmatisation
(reduce words to their root form)
│
▼
Feature Extraction
(convert text into numbers: Bag of Words, TF-IDF, Word Embeddings)
│
▼
Model Training
(classification, clustering, regression on text features)
│
▼
Prediction / Output
Step 1 – Text Cleaning
import re
import string
def clean_text(text):
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r"http\S+|www\S+", "", text)
# Remove HTML tags
text = re.sub(r"<.*?>", "", text)
# Remove punctuation and numbers
text = re.sub(r"[^a-z\s]", "", text)
# Remove extra whitespace
text = " ".join(text.split())
return text
# Test
raw = "Amazing product! 5/5 stars ⭐ Visit https://shop.com for more. #BestBuy"
clean = clean_text(raw)
print("Raw :", raw)
print("Cleaned:", clean)
Output:
Raw : Amazing product! 5/5 stars ⭐ Visit https://shop.com for more. #BestBuy Cleaned: amazing product stars visit for more bestbuy
Step 2 – Tokenisation
Tokenisation splits text into individual units called tokens. Word tokenisation produces one token per word. Sentence tokenisation produces one token per sentence.
import nltk
nltk.download("punkt", quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize
text = "The camera quality is excellent. Battery life could be better. Overall a great phone!"
# Word tokens
word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)
# Sentence tokens
sent_tokens = sent_tokenize(text)
print("\nSentence Tokens:")
for i, s in enumerate(sent_tokens, 1):
print(f" Sentence {i}: {s}")
Output:
Word Tokens: ['The', 'camera', 'quality', 'is', 'excellent', '.', 'Battery',
'life', 'could', 'be', 'better', '.', 'Overall', 'a', 'great', 'phone', '!']
Sentence 1: The camera quality is excellent.
Sentence 2: Battery life could be better.
Sentence 3: Overall a great phone!
Step 3 – Stop Word Removal
Stop words are common words that appear in almost every sentence but carry very little meaning — "the", "is", "a", "and". Removing them reduces noise and speeds up model training.
from nltk.corpus import stopwords
nltk.download("stopwords", quiet=True)
stop_words = set(stopwords.words("english"))
# Sample tokens
tokens = ["the", "camera", "quality", "is", "excellent", "a", "very", "good", "phone"]
# Remove stop words
filtered = [word for word in tokens if word not in stop_words]
print("Before:", tokens)
print("After :", filtered)
print("Removed:", set(tokens) - set(filtered))
Output:
Before: ['the', 'camera', 'quality', 'is', 'excellent', 'a', 'very', 'good', 'phone']
After : ['camera', 'quality', 'excellent', 'good', 'phone']
Removed: {'is', 'the', 'a', 'very'}
Step 4 – Stemming and Lemmatisation
Stemming and Lemmatisation both reduce words to a base form — so "running", "runs", and "ran" all map to the same root. This reduces vocabulary size and prevents the model from treating related words as unrelated.
Diagram – Stemming vs Lemmatisation
Word Stemming (Porter) Lemmatisation ───────────── ────────────────── ───────────── "running" → "run" "run" "better" → "better" "good" ← uses context "studies" → "studi" "study" "caring" → "care" "care" "feet" → "feet" "foot" ← uses dictionary Stemming: Fast, crude — chops off endings (may not be real words) Lemmatisation: Slower, accurate — uses dictionary to find root
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet", quiet=True)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "better", "caring", "feet", "wolves"]
print(f"{'Word':<12} {'Stemmed':<12} {'Lemmatised'}")
print("-" * 36)
for word in words:
stemmed = stemmer.stem(word)
lemmatised = lemmatizer.lemmatize(word)
print(f"{word:<12} {stemmed:<12} {lemmatised}")
Output:
Word Stemmed Lemmatised ------------------------------------ running run running studies studi study better better better caring care caring feet feet foot wolves wolv wolf
Step 5 – Feature Extraction
Machine learning models require numbers as input, not raw text. Feature extraction converts text into numeric vectors that capture the information in the words.
Bag of Words (BoW)
Bag of Words counts how many times each word in the vocabulary appears in a document. The word order and grammar are ignored — only the word counts matter.
Corpus (3 documents):
Doc 1: "the food was great"
Doc 2: "the service was bad"
Doc 3: "great food great service"
Vocabulary: [bad, food, great, service, the, was]
BoW Matrix:
bad food great service the was
Doc 1: 0 1 1 0 1 1
Doc 2: 1 0 0 1 1 1
Doc 3: 0 1 2 1 0 0
Each row is a document, each column is a word count.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"the food was great",
"the service was bad",
"great food great service"
]
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBoW Matrix:\n", X_bow.toarray())
TF-IDF – Term Frequency–Inverse Document Frequency
TF-IDF improves on Bag of Words by down-weighting words that appear in many documents (like "the", "was") and up-weighting words that are rare but important in a specific document. This better reflects what makes each document unique.
TF = Count of word in document / Total words in document IDF = log(Total documents / Documents containing the word) TF-IDF = TF × IDF Word "great" in Doc 3: TF = 2/4 = 0.50 (appears twice in 4-word doc) IDF = log(3/2) = 0.405 (appears in 2 of 3 documents) Score = 0.50 × 0.405 = 0.20 Word "bad" in Doc 2: TF = 1/4 = 0.25 (appears once) IDF = log(3/1) = 1.099 (appears in only 1 document → rarer → higher IDF) Score = 0.25 × 1.099 = 0.27
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
import pandas as pd
df_tfidf = pd.DataFrame(
X_tfidf.toarray(),
columns=tfidf.get_feature_names_out(),
index=["Doc1", "Doc2", "Doc3"]
)
print(df_tfidf.round(3))
Sentiment Analysis – Full Example
Sentiment analysis classifies text as positive, negative, or neutral. This complete example builds a sentiment classifier for product reviews using TF-IDF and Logistic Regression.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
# Sample labelled reviews (1 = positive, 0 = negative)
reviews = [
"This product is absolutely amazing and works perfectly",
"Very disappointed with the quality, completely broken",
"Excellent value for money, highly recommend",
"Terrible experience, will never buy again",
"Outstanding product, exceeded all expectations",
"Complete waste of money, horrible quality",
"Great build quality and fast delivery",
"Does not work as described, very frustrating",
"Love this product! Best purchase this year",
"Poor quality, stopped working after 2 days",
"Fantastic item, exactly as described",
"Awful experience, customer service unhelpful",
"Superb quality and arrived on time",
"Broken on arrival, very disappointed",
"Brilliant product, highly recommended to everyone"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
# Split
X_train, X_test, y_train, y_test = train_test_split(
reviews, labels, test_size=0.3, random_state=42
)
# Build pipeline: TF-IDF → Logistic Regression
pipeline = Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=1)),
("clf", LogisticRegression(max_iter=1000, random_state=42))
])
# Train
pipeline.fit(X_train, y_train)
# Evaluate
y_pred = pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))
# Predict on new reviews
new_reviews = [
"The battery life is incredible and the display is sharp",
"Stopped working after one week, very annoyed"
]
predictions = pipeline.predict(new_reviews)
for review, pred in zip(new_reviews, predictions):
label = "Positive 😊" if pred == 1 else "Negative 😞"
print(f"\n{label}: {review}")
Output:
Accuracy: 1.0 Positive 😊: The battery life is incredible and the display is sharp Negative 😞: Stopped working after one week, very annoyed
N-Grams – Capturing Word Sequences
A unigram is a single word. A bigram is a pair of consecutive words. N-grams capture context that single words miss — "not good" is very different from "good" and "not" treated separately.
from sklearn.feature_extraction.text import CountVectorizer
text = ["the product is not good at all"]
# Unigrams (single words)
cv1 = CountVectorizer(ngram_range=(1, 1))
print("Unigrams:", cv1.fit_transform(text).toarray())
print("Vocab:", cv1.get_feature_names_out())
# Bigrams (word pairs)
cv2 = CountVectorizer(ngram_range=(2, 2))
print("\nBigrams:", cv2.fit_transform(text).toarray())
print("Vocab:", cv2.get_feature_names_out())
Output (Bigrams):
Vocab: ['all at', 'at all', 'good at', 'is not', 'not good', 'product is', 'the product'] → "not good" is captured as a bigram → model understands negation → "is not" signals negation even before seeing the adjective
Named Entity Recognition with spaCy
Named Entity Recognition (NER) identifies real-world entities in text — people, organisations, locations, dates, money, and more. spaCy is the fastest and most accurate NER library in Python.
# Installation: pip install spacy
# python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
text = """Apple announced a new iPhone at its event in California on September 12th.
The device costs $999 and will be available from Tim Cook's team next month."""
doc = nlp(text)
print("Named Entities Found:")
print(f"{'Entity Text':<30} {'Label':<15} {'Meaning'}")
print("-" * 65)
entity_meanings = {
"ORG": "Organisation",
"PRODUCT": "Product",
"GPE": "Country/City/State",
"DATE": "Date or Period",
"MONEY": "Monetary Value",
"PERSON": "Person Name"
}
for ent in doc.ents:
meaning = entity_meanings.get(ent.label_, ent.label_)
print(f"{ent.text:<30} {ent.label_:<15} {meaning}")
Output:
Named Entities Found: Entity Text Label Meaning ----------------------------------------------------------------- Apple ORG Organisation iPhone PRODUCT Product California GPE Country/City/State September 12th DATE Date or Period $999 MONEY Monetary Value Tim Cook PERSON Person Name
Word Frequency Analysis
from collections import Counter
import matplotlib.pyplot as plt
sample_text = """data science machine learning artificial intelligence
python data analysis machine learning deep learning
data science python programming statistics machine"""
# Tokenise and count
words = sample_text.lower().split()
counts = Counter(words)
top10 = counts.most_common(10)
words_list, freq_list = zip(*top10)
plt.figure(figsize=(9, 4))
plt.bar(words_list, freq_list, color="steelblue", edgecolor="black")
plt.title("Top 10 Most Frequent Words")
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.savefig("word_frequency.png")
plt.show()
Key NLP Libraries in Python
| Library | Best For | Install |
|---|---|---|
| NLTK | Tokenisation, stemming, stop words, learning NLP concepts | pip install nltk |
| spaCy | Fast NER, dependency parsing, production NLP pipelines | pip install spacy |
| Scikit-learn | BoW, TF-IDF, text classification models | pip install scikit-learn |
| Transformers (HuggingFace) | State-of-the-art BERT, GPT models, semantic search | pip install transformers |
| Gensim | Word2Vec embeddings, topic modelling (LDA) | pip install gensim |
Summary
- NLP converts unstructured text into structured numeric data that machine learning models can process
- The NLP pipeline includes cleaning, tokenisation, stop word removal, stemming/lemmatisation, and feature extraction
- Bag of Words counts word occurrences; TF-IDF weights words by how unique they are to a document
- N-grams capture word sequences and context that single-word features miss
- Logistic Regression combined with TF-IDF produces a strong baseline for text classification tasks
- Named Entity Recognition extracts structured information — names, organisations, dates — from free text
- spaCy provides fast, accurate NER and text parsing for production-grade applications
- HuggingFace Transformers provide access to state-of-the-art language models for advanced NLP tasks
