Affiliate disclosure: We earn a commission if you sign up through our links. This does not influence our test results or editorial scores. Full disclosure →

Web Scraping with Python (2026): from 'I'll write it' to 'I'll buy an API'

Last reviewed: 2026-01-20 · 14 min read · WebScrapingTool.net

web scraping with python gets 201,000 searches per month in 2026. That’s down 63% year-on-year — and the reason why is the same reason you should read the second half of this article before writing a single line of code.

This guide covers:

Raw requests + BeautifulSoup — when it works, and it does work for a lot of use cases
When you need Playwright — JavaScript rendering, SPAs, login flows
Where the DIY wall is — the specific point at which maintaining a custom scraper costs more than buying an API
What to buy instead — named tools with named prices

If you’re here because you Googled “web scraping python 2026” and want the tutorial: here it is. If you’re here because your scraper broke and Cloudflare is blocking you: skip to Where DIY hits the wall.

Part 1: requests + BeautifulSoup — the working baseline

For non-protected HTML pages — static marketing sites, public government datasets, Wikipedia, blog posts — requests + BeautifulSoup still works reliably in 2026. Here’s the full working pattern:

BeautifulSoup — basic scraping pattern

import requests
from bs4 import BeautifulSoup
import time

def scrape_product_page(url: str) -> dict:
  headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en-US,en;q=0.5',
  }
  
  response = requests.get(url, headers=headers, timeout=10)
  response.raise_for_status()
  
  soup = BeautifulSoup(response.text, 'html.parser')
  
  return {
      'title': soup.find('h1').get_text(strip=True) if soup.find('h1') else None,
      'price': soup.find(class_='price').get_text(strip=True) if soup.find(class_='price') else None,
      'description': soup.find(class_='description').get_text(strip=True) if soup.find(class_='description') else None,
  }

# Polite crawling: sleep between requests
for url in product_urls:
  data = scrape_product_page(url)
  print(data)
  time.sleep(1)  # Respect the server

Install:

Install dependencies

pip install requests beautifulsoup4 lxml

This pattern works well for:

Static HTML pages (no JavaScript rendering required)
Sites without aggressive anti-bot detection
Low-volume scraping (hundreds to low thousands of requests)
Internal tools, one-off datasets, research scraping

It fails — reliably and predictably — when:

The page requires JavaScript to render content
The site uses Cloudflare, DataDome, PerimeterX, or similar anti-bot detection
You’re hitting the site faster than it allows per IP
The target site checks for residential IP origin

Part 2: when you need Playwright (or Puppeteer/Selenium)

JavaScript-heavy sites render content client-side. The HTML returned by requests is the shell — the product data, prices, and reviews are injected by JavaScript after page load. BeautifulSoup parsing a JavaScript-rendered page will find nothing.

The fix is headless browser automation — running a real Chrome instance that executes JavaScript before you extract the HTML.

Playwright — JavaScript rendering

from playwright.sync_api import sync_playwright

def scrape_spa_page(url: str) -> str:
  with sync_playwright() as p:
      browser = p.chromium.launch(headless=True)
      page = browser.new_page()
      
      # Wait for the content element to appear
      page.goto(url)
      page.wait_for_selector('.product-price', timeout=5000)
      
      html = page.content()
      browser.close()
      return html

# Now parse with BeautifulSoup
from bs4 import BeautifulSoup
html = scrape_spa_page('https://example.com/product')
soup = BeautifulSoup(html, 'html.parser')

Install Playwright:

Install Playwright

pip install playwright
playwright install chromium

Playwright handles SPAs, infinite scroll, and login flows. For many use cases — internal tools, competitive research on non-protected sites — this is sufficient.

The problem: Playwright in a headless environment has a detectable fingerprint. Headless Chromium behaves subtly differently from real Chrome — in WebGL rendering, in canvas output, in the presence of certain JavaScript APIs. Anti-bot systems detect it.

Where DIY hits the wall

DIY → API

Where DIY hits the wall

If you used to Google "node js web scraper", you're in the right place. The market shifted. DIY-tutorial keywords are down 63–88% YoY — not because demand died, but because ChatGPT writes the scraper and Cloudflare Turnstile breaks it three days later. The solo developer buying ScraperAPI at $49/mo is not replacing a Scrapy pipeline — they're replacing a weekend they don't have. Here's where the wall is, and what to buy instead.

Find the right API in 60 seconds →

The specific point at which DIY scraping breaks down:

Target site deploys Cloudflare Turnstile or DataDome. These systems run JavaScript continuously during the session, analyzing mouse movement, scroll speed, canvas rendering, TLS fingerprint, and hundreds of other signals. A headless Playwright instance — even with puppeteer-stealth — achieves 30–60% success on well-deployed DataDome. The remaining 40–70% of requests get blocked with a 403 or a CAPTCHA that playwright-stealth can’t solve.

The maintenance cycle. Anti-bot vendors update their detection models. When they push an update, your scraper breaks. Typical recovery time: 2–5 days of debugging. For a solo developer or small team, this happens 4–8 times per year on hard targets. That’s 8–40 engineer-days per year maintaining a scraper — at $400–800/day, that’s $3,200–$32,000 per year in opportunity cost.

The math at $49/mo: ScraperAPI Hobby tier at $49/mo with 20K credits covers approximately 4,000 protected-site requests (5 credits each). If your workload is under 4K protected requests per month, ScraperAPI at $49/mo is cheaper than the engineer-time to maintain a custom solution.

At 50K protected requests per month: Zyte at ~$240/mo with 94.3% success vs a custom Playwright stack at ~$300/mo in proxy costs + maintenance time, at 50–70% success. The API wins on both cost and reliability.

The migration path: from DIY to API

If you have a working requests scraper:

Migration: requests → ScraperAPI (one line change)

import requests

API_KEY = 'YOUR_SCRAPERAPI_KEY'

def scrape_with_api(url: str) -> str:
  """Replaces your existing requests.get() call."""
  params = {
      'api_key': API_KEY,
      'url': url,
      # Add render=true if the page needs JavaScript
      # Add premium=true for Cloudflare/DataDome protected targets
  }
  response = requests.get('https://api.scraperapi.com', params=params)
  return response.text

# Your parsing code stays exactly the same
html = scrape_with_api('https://example.com/product')
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

One function change. Your BeautifulSoup parsing code is unchanged. The API handles the proxy rotation, CAPTCHA solving, and anti-bot bypass.

If you have a Scrapy pipeline, the migration to Zyte Cloud is a deploy command — not a rewrite.

Tool recommendations by use case

Use case	Tool	Price	Why
Non-protected HTML, dev/testing	requests + BeautifulSoup	Free	No anti-bot to bypass
JavaScript rendering, non-protected	Playwright	Free	Headless browser handles JS
Protected targets under $100/mo	ScraperAPI	$49/mo	Proxy + CAPTCHA managed, simple integration
No-code, any target	Apify	$49/mo	Actor marketplace, no code needed
Enterprise / compliance	Zyte	$450/mo+	94% success, DPA available
SERP data	Bright Data SERP API	$3/1K	Cheapest SERP option tested

Part 3: Scrapy — for large-scale pipelines

For teams scraping millions of pages across dozens of domains, Scrapy is the production-grade Python scraping framework:

Scrapy spider — basic structure

import scrapy

class ProductSpider(scrapy.Spider):
  name = 'products'
  start_urls = ['https://example.com/products']

  def parse(self, response):
      for product in response.css('.product'):
          yield {
              'title': product.css('h2::text').get(),
              'price': product.css('.price::text').get(),
              'url': product.css('a::attr(href)').get(),
          }
      
      # Follow pagination
      next_page = response.css('a.next-page::attr(href)').get()
      if next_page:
          yield response.follow(next_page, self.parse)

Scrapy handles concurrency, request queuing, retry logic, and output pipelines (CSV, JSON, database) out of the box. For teams that are Scrapy-native, Zyte’s Scrapy Cloud is the natural hosting layer — deploy your existing spider with one command and get managed scheduling, anti-bot bypass, and monitoring.

Summary

requests + BeautifulSoup works for non-protected HTML at low-to-medium volume. Start here.
Playwright/Puppeteer works for JavaScript-heavy sites that aren’t aggressively protected. Required for SPAs.
DIY breaks down on Cloudflare Turnstile, DataDome, Akamai Bot Manager. The maintenance cost exceeds the API cost above ~$50/mo of engineering time.
The API migration is one function call if you’re on requests. Your parsing code is unchanged.
Choose the API based on your budget, target type, and compliance requirements — the decision wizard does this in 60 seconds.

The 63% decline in “web scraping with python” searches isn’t demand dying. It’s buyers migrating from DIY scrapers to managed APIs. This article is the bridge.

DIY → API