Note: We have emphasized respecting robots.txt and website terms of service.
Let's begin.
Python Web Scraping Tutorial
Extract data from websites using Python with ethical guidelines.
1. Why Web Scraping?
Automate data collection for:
- Market research
- Price monitoring
- Machine learning datasets
- Aggregation platforms
⚠️ Always:
- Check robots.txt
(e.g., https://example.com/robots.txt
)
- Respect website terms of service
- Limit request rate (add delays)
- Use APIs if available
2. Tools You’ll Need
pip install requests beautifulsoup4 pandas
requests
: Fetch web pages.BeautifulSoup
: Parse HTML/XML.pandas
: Export data (CSV/Excel).
3. Basic Scraping Example
Extract book titles from books.toscrape.com:
import requests
from bs4 import BeautifulSoup
# Fetch the webpage
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Find all <h3> tags (which contain book titles)
titles = soup.find_all("h3")
# Extract text from each title
for title in titles:
print(title.a["title"])
Output:
A Light in the Attic
Tipping the Velvet
Soumission
...
4. Advanced Techniques
a) Pagination Handling
Loop through multiple pages:
base_url = "http://books.toscrape.com/catalogue/page-{}.html"
for page in range(1, 51): # 50 pages
url = base_url.format(page)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# ...extract data here...
b) CSS Selectors
Get book prices using classes:
prices = soup.select("p.price_color")
for price in prices:
print(price.get_text())
c) Data Storage
Save to CSV with pandas:
import pandas as pd
data = {"Title": titles_list, "Price": prices_list}
df = pd.DataFrame(data)
df.to_csv("books.csv", index=False)
5. Dynamic Websites (JavaScript-Rendered Content)
Use selenium
when data loads via JavaScript:
pip install selenium webdriver-manager
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://dynamic-website.com")
# Wait for content to load
driver.implicitly_wait(10)
# Extract rendered HTML
soup = BeautifulSoup(driver.page_source, "html.parser")
# ...parse as before...
driver.quit()
6. Avoid Blocking: Best Practices
- Rotate User-Agents:
python headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"} requests.get(url, headers=headers)
- Add Delays:
python import time time.sleep(2) # Wait 2 seconds between requests
- Use Proxies: For large-scale scraping.
7. Ethical Scraping Checklist
- ✅ Check
robots.txt
. - ✅ Do not overload servers (≥3s between requests).
- ✅ Scrape during off-peak hours.
- ✅ Credit the data source.
- ❌ Never scrape personal data.
8. When to Stop?
Switch to an API if:
- The site offers one (e.g., Twitter, Reddit).
- You need real-time data.
- Scraping violates terms.
Final Output Example
books.csv
:
| Title | Price |
|----------------------|-------|
| A Light in the Attic | £51.77|
| Tipping the Velvet | £53.74|
Resources
- BeautifulSoup Documentation
- Scraping Sandbox: books.toscrape.com
- Legal Guide to Web Scraping
Practice responsibly and happy scraping! 🚀