Python Web Scraping Tutorial: Beautiful Soup and…

Web scraping is a powerful technique for extracting data from websites. Python's combination of the Requests library for HTTP requests and Beautiful Soup for HTML parsing makes it the go-to language for web scraping projects.

Setting Up Your Environment

pip install requests beautifulsoup4 lxml pandas

Basic Scraping Pattern

import requests
from bs4 import BeautifulSoup

url = "https://example.com/products"
response = requests.get(url)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

products = soup.find_all("div", class_="product-card")
for product in products:
    name = product.find("h2").text.strip()
    price = product.find("span", class_="price").text.strip()
    print(f"{name}: {price}")

Navigating the DOM

# Find by ID
header = soup.find(id="main-header")

# Find by CSS class
items = soup.find_all("div", class_="item")

# Find by attribute
links = soup.find_all("a", href=True)

# CSS selectors
articles = soup.select("article.post > h2 > a")
sidebar = soup.select_one("#sidebar .widget")

# Navigate the tree
parent = tag.parent
siblings = tag.find_next_siblings("li")
children = tag.find_all(recursive=False)

Handling Headers and Sessions

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml"
})

response = session.get(url)

# Handle cookies
session.cookies.set("session_id", "abc123")

# Handle authentication
session.post("https://example.com/login", data={
    "username": "user",
    "password": "pass"
})

Pagination

import time

all_data = []
page = 1

while True:
    url = f"https://example.com/products?page={page}"
    response = session.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    
    items = soup.find_all("div", class_="product")
    if not items:
        break
    
    for item in items:
        all_data.append({
            "name": item.find("h3").text.strip(),
            "price": item.find(".price").text.strip(),
            "url": item.find("a")["href"]
        })
    
    page += 1
    time.sleep(1)  # Be respectful - wait between requests

Saving Data

import pandas as pd
import json

# Save to CSV
df = pd.DataFrame(all_data)
df.to_csv("products.csv", index=False)

# Save to JSON
with open("products.json", "w") as f:
    json.dump(all_data, f, indent=2)

Error Handling

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def scrape_page(url, retries=3):
    for attempt in range(retries):
        try:
            response = session.get(url, timeout=10)
            response.raise_for_status()
            return BeautifulSoup(response.text, "lxml")
        except requests.RequestException as e:
            logger.warning(f"Attempt {attempt + 1} failed for {url}: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff
    
    logger.error(f"All retries failed for {url}")
    return None

Ethical Scraping Guidelines

Check robots.txt: Respect the website's crawling rules
Rate limiting: Add delays between requests (1-2 seconds minimum)
Identify yourself: Use a descriptive User-Agent string
Cache responses: Avoid re-scraping the same pages
Check terms of service: Some sites explicitly prohibit scraping
Use APIs when available: APIs are more reliable and respectful

Web scraping is a valuable skill for data collection and analysis. Always practice responsible scraping and consider the impact on the websites you interact with.

asyncio in Python 3.13: TaskGroups, ExceptionGroups, and Real-World Patterns

Python 3.13 has finalized the structured-concurrency model that began with TaskGroups in 3.11 — and the result is asyncio code that is dramatically easier to reason about, debug, and operate in production. This is a practical guide to the patterns that work in 2026: TaskGroups for concurrent operations, ExceptionGroups for handling multiple failures cleanly, the cancellation model that finally makes sense, and the production patterns that experienced asyncio developers reach for....

Python 3.13 Free-Threaded Mode: Real-World Performance Benchmarks for 2026

Python 3.13 ships an experimental free-threaded build that finally removes the GIL. The marketing is exciting; the reality is more nuanced. We benchmarked free-threaded Python on a representative range of workloads — CPU-bound, I/O-bound, mixed — and measured the single-threaded slowdown that nobody talks about. Here is what to actually expect in 2026, and when free-threaded mode is ready for production use....

Python Error Handling: Try-Except Patterns Every Developer Should Know

Master Python error handling with try-except blocks. Learn essential patterns for exception handling, custom exceptions, logging, and writing robust Python applications....

Categories

Python Web Scraping Tutorial: Beautiful Soup and Requests Complete Guide

Setting Up Your Environment

Basic Scraping Pattern

Navigating the DOM

Handling Headers and Sessions

Pagination

Saving Data

Error Handling

Ethical Scraping Guidelines

Bas van den Berg

Stay Updated

Categories

Setting Up Your Environment

Basic Scraping Pattern

Navigating the DOM

Handling Headers and Sessions

Pagination

Saving Data

Error Handling

Ethical Scraping Guidelines

Bas van den Berg

Related Articles

asyncio in Python 3.13: TaskGroups, ExceptionGroups, and Real-World Patterns

Python 3.13 Free-Threaded Mode: Real-World Performance Benchmarks for 2026

Python Error Handling: Try-Except Patterns Every Developer Should Know

Stay Updated