Scrapy Playwright: A Powerful Web Scraping and Automation Tool

Scrapy Playwright is the fastest way I know to keep Scrapy’s crawl speed while still extracting data from JavaScript-heavy pages—without rewriting your whole spider into a browser automation script. It’s not magic, though: you’re paying a CPU/RAM tax for a real browser, so the win is selective rendering, not “turn Playwright on everywhere.”​

Scrapy Playwright review

I use Scrapy Playwright when plain Scrapy returns “perfectly valid HTML” that’s basically empty—because the real content shows up only after JavaScript runs, a selector appears, or a button/scroll event fires. The big upside is it plugs into Scrapy as a download handler, so my scheduling, pipelines, and item processing stay Scrapy-native while Playwright renders only the requests I explicitly flag. The downside is predictable: browsers are heavier than HTTP, and misuse (especially forgetting to close pages) can freeze a crawl in ways that feel like ghost bugs.​

Features and benefits of Scrapy Playwright

Scrapy Playwright behaves like a Scrapy download handler that performs certain requests using Playwright for Python, so I can handle JS-required pages “as seen by the browser” while keeping Scrapy’s normal workflow intact. I enable it per-request with meta={“playwright”: True}, which prevents the “browser everywhere” slowdown and keeps my static pages on the normal Scrapy downloader.​

Key benefits I actually care about in production:

  • Selective JavaScript rendering using the playwright meta key, instead of rewriting spiders.​
  • Browser selection via PLAYWRIGHT_BROWSER_TYPE (chromium/firefox/webkit), which matters when sites behave differently across engines.​
  • Page interactions before parsing using playwright_page_methods and PageMethod(…) (click, wait, evaluate, screenshot, etc.).​
  • Multi-session isolation using browser contexts (PLAYWRIGHT_CONTEXTS + playwright_context meta) when I need separate cookies/storage.​
  • Safety knobs like PLAYWRIGHT_ABORT_REQUEST to block wasteful resources (images/media) to speed up dynamic web scraping.​

How Scrapy Playwright works

Below is the “real-world” flow I follow when wiring Scrapy Playwright into a spider.​

Install the integration and browser binaries.

  • pip install scrapy-playwright (package install)​
  • playwright install (download browser engines)​

Activate Scrapy Playwright as a download handler in Scrapy settings.

  • Set DOWNLOAD_HANDLERS[“http”] and DOWNLOAD_HANDLERS[“https”] to scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler.​

Ensure the asyncio-based Twisted reactor is enabled.

  • Use TWISTED_REACTOR = “twisted.internet.asyncioreactor.AsyncioSelectorReactor” (this is the default in new Scrapy projects since 2.7).​

Mark only the requests that truly need browser rendering.

  • Use scrapy.Request(url, meta={“playwright”: True}) so Scrapy Playwright only runs where needed.​

If the page is dynamic, add deterministic waits/actions before parsing.

  • Add playwright_page_methods with PageMethod(“wait_for_selector”, “…”) or PageMethod(“evaluate”, “…”) to scroll/click/trigger loading.​

Decide if you need the live Page object in callbacks.

  • If yes, set playwright_include_page=True and then access response.meta[“playwright_page”].​
  • If no, skip it—pages close automatically and life is simpler.​

If you do include the Page, close it aggressively (or your crawl can stall).

  • Open pages count toward PLAYWRIGHT_MAX_PAGES_PER_CONTEXT, and leaving pages unclosed can make the spider job “get stuck.”​
  • Use an errback that closes the page even on failures.​

Use contexts to separate login and session.

  • Predefine contexts with PLAYWRIGHT_CONTEXTS and pick one using meta[“playwright_context”].​
  • Or create contexts on the fly using playwright_context_kwargs if the named context doesn’t exist yet.​

Tune performance instead of guessing.

  • Limit concurrency per context via PLAYWRIGHT_MAX_PAGES_PER_CONTEXT.​
  • Abort heavy resources via PLAYWRIGHT_ABORT_REQUEST to reduce bandwidth and speed up dynamic web scraping.​

Debug like an adult: capture state when selectors don’t match.

  • Use a PageMethod(“screenshot”, …) action or an explicit screenshot flow to prove what the browser rendered.​
In your scrapy.Request meta
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("screenshot", path="debug_render.png", full_page=True)
]
}
Full working example in a spider:
If you want the canonical docs (the thing I trust most when APIs shift), use the official README.

Where I got stuck (limitations)

My most common failure mode with Scrapy Playwright is thinking “I just need the Page object” and enabling playwright_include_page=True, then forgetting that unclosed pages count against PLAYWRIGHT_MAX_PAGES_PER_CONTEXT—and the crawl can freeze once the limit is reached. Another messy limitation is proxies: there’s explicitly “no per-request proxy support,” so I have to think in terms of browser/context-level proxy configuration instead of Scrapy-style per-request proxy rotation. On Windows, I’ve also had to adopt the separate-thread event loop approach because Playwright can’t run in the same asyncio loop as Scrapy’s Twisted reactor, which adds operational complexity.​

Final remarks on Scrapy Playwright

When I’m doing dynamic web scraping at scale, Scrapy Playwright is worth it only when I treat Playwright as a scalpel: render just the hard pages, wait for deterministic selectors, and close pages/contexts like my crawl depends on it—because it does. If the target data comes from an API call, I still prefer to hit the API directly and keep Scrapy Playwright as a fallback for truly browser-only flows.​

Most Popular