Affiliate disclosure: We earn a commission if you sign up through our links. This does not influence our test results or editorial scores. Full disclosure →

Web Scraping with Python (2026): from 'I'll write it' to 'I'll buy an API'

Last reviewed: 2026-01-20 · 14 min read · WebScrapingTool.net

web scraping with python gets 201,000 searches per month in 2026. That’s down 63% year-on-year — and the reason why is the same reason you should read the second half of this article before writing a single line of code.

This guide covers:

  1. Raw requests + BeautifulSoup — when it works, and it does work for a lot of use cases
  2. When you need Playwright — JavaScript rendering, SPAs, login flows
  3. Where the DIY wall is — the specific point at which maintaining a custom scraper costs more than buying an API
  4. What to buy instead — named tools with named prices

If you’re here because you Googled “web scraping python 2026” and want the tutorial: here it is. If you’re here because your scraper broke and Cloudflare is blocking you: skip to Where DIY hits the wall.

Part 1: requests + BeautifulSoup — the working baseline

For non-protected HTML pages — static marketing sites, public government datasets, Wikipedia, blog posts — requests + BeautifulSoup still works reliably in 2026. Here’s the full working pattern:

BeautifulSoup — basic scraping pattern
import requests
from bs4 import BeautifulSoup
import time

def scrape_product_page(url: str) -> dict:
  headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en-US,en;q=0.5',
  }
  
  response = requests.get(url, headers=headers, timeout=10)
  response.raise_for_status()
  
  soup = BeautifulSoup(response.text, 'html.parser')
  
  return {
      'title': soup.find('h1').get_text(strip=True) if soup.find('h1') else None,
      'price': soup.find(class_='price').get_text(strip=True) if soup.find(class_='price') else None,
      'description': soup.find(class_='description').get_text(strip=True) if soup.find(class_='description') else None,
  }

# Polite crawling: sleep between requests
for url in product_urls:
  data = scrape_product_page(url)
  print(data)
  time.sleep(1)  # Respect the server

Install:

Install dependencies
pip install requests beautifulsoup4 lxml

This pattern works well for:

  • Static HTML pages (no JavaScript rendering required)
  • Sites without aggressive anti-bot detection
  • Low-volume scraping (hundreds to low thousands of requests)
  • Internal tools, one-off datasets, research scraping

It fails — reliably and predictably — when:

  • The page requires JavaScript to render content
  • The site uses Cloudflare, DataDome, PerimeterX, or similar anti-bot detection
  • You’re hitting the site faster than it allows per IP
  • The target site checks for residential IP origin

Part 2: when you need Playwright (or Puppeteer/Selenium)

JavaScript-heavy sites render content client-side. The HTML returned by requests is the shell — the product data, prices, and reviews are injected by JavaScript after page load. BeautifulSoup parsing a JavaScript-rendered page will find nothing.

The fix is headless browser automation — running a real Chrome instance that executes JavaScript before you extract the HTML.

Playwright — JavaScript rendering
from playwright.sync_api import sync_playwright

def scrape_spa_page(url: str) -> str:
  with sync_playwright() as p:
      browser = p.chromium.launch(headless=True)
      page = browser.new_page()
      
      # Wait for the content element to appear
      page.goto(url)
      page.wait_for_selector('.product-price', timeout=5000)
      
      html = page.content()
      browser.close()
      return html

# Now parse with BeautifulSoup
from bs4 import BeautifulSoup
html = scrape_spa_page('https://example.com/product')
soup = BeautifulSoup(html, 'html.parser')

Install Playwright:

Install Playwright
pip install playwright
playwright install chromium

Playwright handles SPAs, infinite scroll, and login flows. For many use cases — internal tools, competitive research on non-protected sites — this is sufficient.

The problem: Playwright in a headless environment has a detectable fingerprint. Headless Chromium behaves subtly differently from real Chrome — in WebGL rendering, in canvas output, in the presence of certain JavaScript APIs. Anti-bot systems detect it.

Where DIY hits the wall

DIY → API

Where DIY hits the wall

If you used to Google "node js web scraper", you're in the right place. The market shifted. DIY-tutorial keywords are down 63–88% YoY — not because demand died, but because ChatGPT writes the scraper and Cloudflare Turnstile breaks it three days later. The solo developer buying ScraperAPI at $49/mo is not replacing a Scrapy pipeline — they're replacing a weekend they don't have. Here's where the wall is, and what to buy instead.

Find the right API in 60 seconds →

The specific point at which DIY scraping breaks down:

Target site deploys Cloudflare Turnstile or DataDome. These systems run JavaScript continuously during the session, analyzing mouse movement, scroll speed, canvas rendering, TLS fingerprint, and hundreds of other signals. A headless Playwright instance — even with puppeteer-stealth — achieves 30–60% success on well-deployed DataDome. The remaining 40–70% of requests get blocked with a 403 or a CAPTCHA that playwright-stealth can’t solve.

The maintenance cycle. Anti-bot vendors update their detection models. When they push an update, your scraper breaks. Typical recovery time: 2–5 days of debugging. For a solo developer or small team, this happens 4–8 times per year on hard targets. That’s 8–40 engineer-days per year maintaining a scraper — at $400–800/day, that’s $3,200–$32,000 per year in opportunity cost.

The math at $49/mo: ScraperAPI Hobby tier at $49/mo with 20K credits covers approximately 4,000 protected-site requests (5 credits each). If your workload is under 4K protected requests per month, ScraperAPI at $49/mo is cheaper than the engineer-time to maintain a custom solution.

At 50K protected requests per month: Zyte at ~$240/mo with 94.3% success vs a custom Playwright stack at ~$300/mo in proxy costs + maintenance time, at 50–70% success. The API wins on both cost and reliability.

The migration path: from DIY to API

If you have a working requests scraper:

Migration: requests → ScraperAPI (one line change)
import requests

API_KEY = 'YOUR_SCRAPERAPI_KEY'

def scrape_with_api(url: str) -> str:
  """Replaces your existing requests.get() call."""
  params = {
      'api_key': API_KEY,
      'url': url,
      # Add render=true if the page needs JavaScript
      # Add premium=true for Cloudflare/DataDome protected targets
  }
  response = requests.get('https://api.scraperapi.com', params=params)
  return response.text

# Your parsing code stays exactly the same
html = scrape_with_api('https://example.com/product')
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

One function change. Your BeautifulSoup parsing code is unchanged. The API handles the proxy rotation, CAPTCHA solving, and anti-bot bypass.

If you have a Scrapy pipeline, the migration to Zyte Cloud is a deploy command — not a rewrite.

Tool recommendations by use case

Use caseToolPriceWhy
Non-protected HTML, dev/testingrequests + BeautifulSoupFreeNo anti-bot to bypass
JavaScript rendering, non-protectedPlaywrightFreeHeadless browser handles JS
Protected targets under $100/moScraperAPI$49/moProxy + CAPTCHA managed, simple integration
No-code, any targetApify$49/moActor marketplace, no code needed
Enterprise / complianceZyte$450/mo+94% success, DPA available
SERP dataBright Data SERP API$3/1KCheapest SERP option tested

Part 3: Scrapy — for large-scale pipelines

For teams scraping millions of pages across dozens of domains, Scrapy is the production-grade Python scraping framework:

Scrapy spider — basic structure
import scrapy

class ProductSpider(scrapy.Spider):
  name = 'products'
  start_urls = ['https://example.com/products']

  def parse(self, response):
      for product in response.css('.product'):
          yield {
              'title': product.css('h2::text').get(),
              'price': product.css('.price::text').get(),
              'url': product.css('a::attr(href)').get(),
          }
      
      # Follow pagination
      next_page = response.css('a.next-page::attr(href)').get()
      if next_page:
          yield response.follow(next_page, self.parse)

Scrapy handles concurrency, request queuing, retry logic, and output pipelines (CSV, JSON, database) out of the box. For teams that are Scrapy-native, Zyte’s Scrapy Cloud is the natural hosting layer — deploy your existing spider with one command and get managed scheduling, anti-bot bypass, and monitoring.

Summary

  1. requests + BeautifulSoup works for non-protected HTML at low-to-medium volume. Start here.
  2. Playwright/Puppeteer works for JavaScript-heavy sites that aren’t aggressively protected. Required for SPAs.
  3. DIY breaks down on Cloudflare Turnstile, DataDome, Akamai Bot Manager. The maintenance cost exceeds the API cost above ~$50/mo of engineering time.
  4. The API migration is one function call if you’re on requests. Your parsing code is unchanged.
  5. Choose the API based on your budget, target type, and compliance requirements — the decision wizard does this in 60 seconds.

The 63% decline in “web scraping with python” searches isn’t demand dying. It’s buyers migrating from DIY scrapers to managed APIs. This article is the bridge.

DIY → API

Where DIY hits the wall

If you used to Google "node js web scraper", you're in the right place. The market shifted. DIY-tutorial keywords are down 63–88% YoY — not because demand died, but because ChatGPT writes the scraper and Cloudflare Turnstile breaks it three days later. The solo developer buying ScraperAPI at $49/mo is not replacing a Scrapy pipeline — they're replacing a weekend they don't have. Here's where the wall is, and what to buy instead.

Find the right API in 60 seconds →

Recommended APIs — skip the DIY wall

ScraperAPI

8.5/10

✓ Indie devs with a deadline

From $49/mo

Read review →

Apify

8.8/10

✓ No-code teams

From $49/mo

Read review →

Zyte

9/10

✓ Enterprise + compliance

From $450/mo

Read review →

Go deeper