Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. Python offers powerful NLP libraries that make it accessible to developers and data scientists. This guide covers the fundamentals using NLTK and spaCy.
Setting Up
pip install nltk spacy textblob
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('averaged_perceptron_tagger')"
Tokenization
import nltk
import spacy
text = "Natural Language Processing is fascinating. It helps computers understand human language."
# NLTK tokenization
from nltk.tokenize import word_tokenize, sent_tokenize
words = word_tokenize(text)
sentences = sent_tokenize(text)
print(f"Words: {words}")
print(f"Sentences: {sentences}")
# spaCy tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]
print(f"spaCy tokens: {tokens}")
Stop Words and Text Cleaning
from nltk.corpus import stopwords
import re
stop_words = set(stopwords.words("english"))
def clean_text(text):
text = text.lower()
text = re.sub(r"[^a-zA-Z\s]", "", text)
words = word_tokenize(text)
filtered = [w for w in words if w not in stop_words]
return " ".join(filtered)
Part-of-Speech Tagging
# spaCy POS tagging
doc = nlp("The quick brown fox jumps over the lazy dog")
for token in doc:
print(f"{token.text}: {token.pos_} ({token.tag_})")
# Output:
# The: DET (DT)
# quick: ADJ (JJ)
# brown: ADJ (JJ)
# fox: NOUN (NN)
# jumps: VERB (VBZ)
# ...
Named Entity Recognition
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976.")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# Output:
# Apple Inc.: ORG
# Steve Jobs: PERSON
# Cupertino: GPE
# California: GPE
# 1976: DATE
Sentiment Analysis
from textblob import TextBlob
texts = [
"This product is amazing and works perfectly!",
"Terrible experience, I want a refund.",
"The service was okay, nothing special."
]
for text in texts:
blob = TextBlob(text)
sentiment = blob.sentiment
print(f"Text: {text}")
print(f" Polarity: {sentiment.polarity:.2f} (-1 to 1)")
print(f" Subjectivity: {sentiment.subjectivity:.2f} (0 to 1)")
Text Similarity
doc1 = nlp("I love programming in Python")
doc2 = nlp("Python programming is my favorite hobby")
doc3 = nlp("The weather is sunny today")
print(f"doc1 vs doc2: {doc1.similarity(doc2):.4f}")
print(f"doc1 vs doc3: {doc1.similarity(doc3):.4f}")
TF-IDF for Text Classification
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
texts = ["great product love it", "terrible waste of money", ...]
labels = [1, 0, ...] # 1 = positive, 0 = negative
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(texts)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
accuracy = classifier.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")
Next Steps
- Explore Hugging Face Transformers for state-of-the-art NLP
- Learn about word embeddings (Word2Vec, GloVe, BERT)
- Practice with real datasets from Kaggle
- Build projects: chatbot, text summarizer, sentiment dashboard
NLP is one of the most exciting and practical areas of artificial intelligence. Start with these fundamentals, build small projects, and gradually explore more advanced techniques with transformers and large language models.