🎁 New User? Get 20% off your first purchase with code NEWUSER20 Register Now β†’
Menu

Categories

Python Web Scraping Tutorial: Beautiful Soup and Requests Complete Guide

Python Web Scraping Tutorial: Beautiful Soup and Requests Complete Guide

Web scraping is a powerful technique for extracting data from websites. Python's combination of the Requests library for HTTP requests and Beautiful Soup for HTML parsing makes it the go-to language for web scraping projects.

Setting Up Your Environment

pip install requests beautifulsoup4 lxml pandas

Basic Scraping Pattern

import requests
from bs4 import BeautifulSoup

url = "https://example.com/products"
response = requests.get(url)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

products = soup.find_all("div", class_="product-card")
for product in products:
    name = product.find("h2").text.strip()
    price = product.find("span", class_="price").text.strip()
    print(f"{name}: {price}")

Navigating the DOM

# Find by ID
header = soup.find(id="main-header")

# Find by CSS class
items = soup.find_all("div", class_="item")

# Find by attribute
links = soup.find_all("a", href=True)

# CSS selectors
articles = soup.select("article.post > h2 > a")
sidebar = soup.select_one("#sidebar .widget")

# Navigate the tree
parent = tag.parent
siblings = tag.find_next_siblings("li")
children = tag.find_all(recursive=False)

Handling Headers and Sessions

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml"
})

response = session.get(url)

# Handle cookies
session.cookies.set("session_id", "abc123")

# Handle authentication
session.post("https://example.com/login", data={
    "username": "user",
    "password": "pass"
})

Pagination

import time

all_data = []
page = 1

while True:
    url = f"https://example.com/products?page={page}"
    response = session.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    
    items = soup.find_all("div", class_="product")
    if not items:
        break
    
    for item in items:
        all_data.append({
            "name": item.find("h3").text.strip(),
            "price": item.find(".price").text.strip(),
            "url": item.find("a")["href"]
        })
    
    page += 1
    time.sleep(1)  # Be respectful - wait between requests

Saving Data

import pandas as pd
import json

# Save to CSV
df = pd.DataFrame(all_data)
df.to_csv("products.csv", index=False)

# Save to JSON
with open("products.json", "w") as f:
    json.dump(all_data, f, indent=2)

Error Handling

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def scrape_page(url, retries=3):
    for attempt in range(retries):
        try:
            response = session.get(url, timeout=10)
            response.raise_for_status()
            return BeautifulSoup(response.text, "lxml")
        except requests.RequestException as e:
            logger.warning(f"Attempt {attempt + 1} failed for {url}: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff
    
    logger.error(f"All retries failed for {url}")
    return None

Ethical Scraping Guidelines

  1. Check robots.txt: Respect the website's crawling rules
  2. Rate limiting: Add delays between requests (1-2 seconds minimum)
  3. Identify yourself: Use a descriptive User-Agent string
  4. Cache responses: Avoid re-scraping the same pages
  5. Check terms of service: Some sites explicitly prohibit scraping
  6. Use APIs when available: APIs are more reliable and respectful

Web scraping is a valuable skill for data collection and analysis. Always practice responsible scraping and consider the impact on the websites you interact with.

Share this article:
Bas van den Berg
About the Author

Bas van den Berg

IT Administrator, Security Architect, Infrastructure Security Specialist, Technical Author

Bas van den Berg is an experienced IT Administrator and Security Architect specializing in the design, protection, and long-term operation of secure IT infrastructures.

With a strong background in system administration and cybersecurity, he has worked extensively with enterprise environments, focusing on access control...

IT Administration Security Architecture Network Security System Hardening Access Control

Stay Updated

Subscribe to our newsletter for the latest tutorials, tips, and exclusive offers.