A tutorial on how to use Python for web scraping.

A tutorial on web scraping with Python. We'll cover: 1. Introduction to web scraping and its uses. 2. Legal and ethical considerations. 3. Tools and libraries: requests, BeautifulSoup, and lxml. 4. Step-by-step example: scraping a sample website (we'll use books.toscrape.com as an example). 5. Handling common issues (pagination, dynamic content, etc.). 6. Storing the scraped data (CSV, JSON, etc.). 7. Best practices and tips.

Note: We have emphasized respecting robots.txt and website terms of service.

Let's begin.

Python Web Scraping Tutorial

Extract data from websites using Python with ethical guidelines.

1. Why Web Scraping?

Automate data collection for:
- Market research
- Price monitoring
- Machine learning datasets
- Aggregation platforms

⚠️ Always:
- Check robots.txt (e.g., https://example.com/robots.txt)
- Respect website terms of service
- Limit request rate (add delays)
- Use APIs if available

2. Tools You’ll Need

pip install requests beautifulsoup4 pandas

requests: Fetch web pages.
BeautifulSoup: Parse HTML/XML.
pandas: Export data (CSV/Excel).

3. Basic Scraping Example

Extract book titles from books.toscrape.com:

import requests
from bs4 import BeautifulSoup

# Fetch the webpage
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all <h3> tags (which contain book titles)
titles = soup.find_all("h3")

# Extract text from each title
for title in titles:
    print(title.a["title"])

Output:

A Light in the Attic
Tipping the Velvet
Soumission
...

4. Advanced Techniques

a) Pagination Handling
Loop through multiple pages:

base_url = "http://books.toscrape.com/catalogue/page-{}.html"

for page in range(1, 51):  # 50 pages
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    # ...extract data here...

b) CSS Selectors
Get book prices using classes:

prices = soup.select("p.price_color")
for price in prices:
    print(price.get_text())

c) Data Storage
Save to CSV with pandas:

import pandas as pd

data = {"Title": titles_list, "Price": prices_list}
df = pd.DataFrame(data)
df.to_csv("books.csv", index=False)

5. Dynamic Websites (JavaScript-Rendered Content)

Use selenium when data loads via JavaScript:

pip install selenium webdriver-manager

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://dynamic-website.com")

# Wait for content to load
driver.implicitly_wait(10)

# Extract rendered HTML
soup = BeautifulSoup(driver.page_source, "html.parser")
# ...parse as before...
driver.quit()

6. Avoid Blocking: Best Practices

Rotate User-Agents:
python headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"} requests.get(url, headers=headers)
Add Delays:
python import time time.sleep(2) # Wait 2 seconds between requests
Use Proxies: For large-scale scraping.

7. Ethical Scraping Checklist

✅ Check robots.txt.
✅ Do not overload servers (≥3s between requests).
✅ Scrape during off-peak hours.
✅ Credit the data source.
❌ Never scrape personal data.

8. When to Stop?

Switch to an API if:
- The site offers one (e.g., Twitter, Reddit).
- You need real-time data.
- Scraping violates terms.

Final Output Example
books.csv:
| Title | Price |
|----------------------|-------|
| A Light in the Attic | £51.77|
| Tipping the Velvet | £53.74|

Resources
- BeautifulSoup Documentation
- Scraping Sandbox: books.toscrape.com
- Legal Guide to Web Scraping

Practice responsibly and happy scraping! 🚀