Web scraping is a powerful technique for extracting data from websites. Python's combination of the Requests library for HTTP requests and Beautiful Soup for HTML parsing makes it the go-to language for web scraping projects.
Setting Up Your Environment
pip install requests beautifulsoup4 lxml pandas
Basic Scraping Pattern
import requests
from bs4 import BeautifulSoup
url = "https://example.com/products"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
products = soup.find_all("div", class_="product-card")
for product in products:
name = product.find("h2").text.strip()
price = product.find("span", class_="price").text.strip()
print(f"{name}: {price}")
Navigating the DOM
# Find by ID
header = soup.find(id="main-header")
# Find by CSS class
items = soup.find_all("div", class_="item")
# Find by attribute
links = soup.find_all("a", href=True)
# CSS selectors
articles = soup.select("article.post > h2 > a")
sidebar = soup.select_one("#sidebar .widget")
# Navigate the tree
parent = tag.parent
siblings = tag.find_next_siblings("li")
children = tag.find_all(recursive=False)
Handling Headers and Sessions
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml"
})
response = session.get(url)
# Handle cookies
session.cookies.set("session_id", "abc123")
# Handle authentication
session.post("https://example.com/login", data={
"username": "user",
"password": "pass"
})
Pagination
import time
all_data = []
page = 1
while True:
url = f"https://example.com/products?page={page}"
response = session.get(url)
soup = BeautifulSoup(response.text, "lxml")
items = soup.find_all("div", class_="product")
if not items:
break
for item in items:
all_data.append({
"name": item.find("h3").text.strip(),
"price": item.find(".price").text.strip(),
"url": item.find("a")["href"]
})
page += 1
time.sleep(1) # Be respectful - wait between requests
Saving Data
import pandas as pd
import json
# Save to CSV
df = pd.DataFrame(all_data)
df.to_csv("products.csv", index=False)
# Save to JSON
with open("products.json", "w") as f:
json.dump(all_data, f, indent=2)
Error Handling
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def scrape_page(url, retries=3):
for attempt in range(retries):
try:
response = session.get(url, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, "lxml")
except requests.RequestException as e:
logger.warning(f"Attempt {attempt + 1} failed for {url}: {e}")
time.sleep(2 ** attempt) # Exponential backoff
logger.error(f"All retries failed for {url}")
return None
Ethical Scraping Guidelines
- Check robots.txt: Respect the website's crawling rules
- Rate limiting: Add delays between requests (1-2 seconds minimum)
- Identify yourself: Use a descriptive User-Agent string
- Cache responses: Avoid re-scraping the same pages
- Check terms of service: Some sites explicitly prohibit scraping
- Use APIs when available: APIs are more reliable and respectful
Web scraping is a valuable skill for data collection and analysis. Always practice responsible scraping and consider the impact on the websites you interact with.