Scrapy Playwright is the fastest way I know to keep Scrapy’s crawl speed while still extracting data from JavaScript-heavy pages—without rewriting your whole spider into a browser automation script. It’s not magic, though: you’re paying a CPU/RAM tax for a real browser, so the win is selective rendering, not “turn Playwright on everywhere.”
Scrapy Playwright review
I use Scrapy Playwright when plain Scrapy returns “perfectly valid HTML” that’s basically empty—because the real content shows up only after JavaScript runs, a selector appears, or a button/scroll event fires. The big upside is it plugs into Scrapy as a download handler, so my scheduling, pipelines, and item processing stay Scrapy-native while Playwright renders only the requests I explicitly flag. The downside is predictable: browsers are heavier than HTTP, and misuse (especially forgetting to close pages) can freeze a crawl in ways that feel like ghost bugs.
Features and benefits of Scrapy Playwright
Scrapy Playwright behaves like a Scrapy download handler that performs certain requests using Playwright for Python, so I can handle JS-required pages “as seen by the browser” while keeping Scrapy’s normal workflow intact. I enable it per-request with meta={“playwright”: True}, which prevents the “browser everywhere” slowdown and keeps my static pages on the normal Scrapy downloader.
Key benefits I actually care about in production:
- Selective JavaScript rendering using the playwright meta key, instead of rewriting spiders.
- Browser selection via PLAYWRIGHT_BROWSER_TYPE (chromium/firefox/webkit), which matters when sites behave differently across engines.
- Page interactions before parsing using playwright_page_methods and PageMethod(…) (click, wait, evaluate, screenshot, etc.).
- Multi-session isolation using browser contexts (PLAYWRIGHT_CONTEXTS + playwright_context meta) when I need separate cookies/storage.
- Safety knobs like PLAYWRIGHT_ABORT_REQUEST to block wasteful resources (images/media) to speed up dynamic web scraping.
How Scrapy Playwright works
Below is the “real-world” flow I follow when wiring Scrapy Playwright into a spider.
Install the integration and browser binaries.
- pip install scrapy-playwright (package install)
- playwright install (download browser engines)
Activate Scrapy Playwright as a download handler in Scrapy settings.
- Set DOWNLOAD_HANDLERS[“http”] and DOWNLOAD_HANDLERS[“https”] to scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler.
Ensure the asyncio-based Twisted reactor is enabled.
- Use TWISTED_REACTOR = “twisted.internet.asyncioreactor.AsyncioSelectorReactor” (this is default in new Scrapy projects since 2.7).
Mark only the requests that truly need browser rendering.
- Use scrapy.Request(url, meta={“playwright”: True}) so Scrapy Playwright only runs where needed.
If the page is dynamic, add deterministic waits/actions before parsing.
- Add playwright_page_methods with PageMethod(“wait_for_selector”, “…”) or PageMethod(“evaluate”, “…”) to scroll/click/trigger loading.
Decide if you need the live Page object in callbacks.
- If yes, set playwright_include_page=True and then access response.meta[“playwright_page”].
- If no, skip it—pages close automatically and life is simpler.
If you do include the Page, close it aggressively (or your crawl can stall).
- Open pages count toward PLAYWRIGHT_MAX_PAGES_PER_CONTEXT, and leaving pages unclosed can make the spider job “get stuck.”
- Use an errback that closes the page even on failures.
Use contexts for login/session separation.
- Predefine contexts with PLAYWRIGHT_CONTEXTS and pick one using meta[“playwright_context”].
- Or create contexts on the fly using playwright_context_kwargs if the named context doesn’t exist yet.
Tune performance instead of guessing.
- Limit concurrency per context via PLAYWRIGHT_MAX_PAGES_PER_CONTEXT.
- Abort heavy resources via PLAYWRIGHT_ABORT_REQUEST to reduce bandwidth and speed up dynamic web scraping.
Debug like an adult: capture state when selectors don’t match.
- Use a PageMethod(“screenshot”, …) action or an explicit screenshot flow to prove what the browser rendered.
In your scrapy.Request meta
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("screenshot", path="debug_render.png", full_page=True)
]
}Full working example in a spider:
import scrapy
from scrapy_playwright.page import PageMethod
class DebugSpider(scrapy.Spider):
name = "debug_render"
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/js/",
meta={
"playwright": True,
"playwright_page_methods": [
# Wait for JS to load quotes
PageMethod("wait_for_selector", "div.quote"),
# Screenshot *after* content loads (this is the key)
PageMethod("screenshot",
path="debug_quotes_rendered.png",
full_page=True),
]
},
callback=self.parse
)
def parse(self, response):
# Your screenshot is now saved in the project root as debug_quotes_rendered.png
self.logger.info(f"Debug screenshot saved: debug_quotes_rendered.png")
# Now extract data normally
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get()
}If you want the canonical docs (the thing I trust most when APIs shift), use the official README.
Where I got stuck (limitations)
My most common failure mode with Scrapy Playwright is thinking “I just need the Page object” and enabling playwright_include_page=True, then forgetting that unclosed pages count against PLAYWRIGHT_MAX_PAGES_PER_CONTEXT—and the crawl can freeze once the limit is reached. Another messy limitation is proxies: there’s explicitly “no per-request proxy support,” so I have to think in terms of browser/context-level proxy configuration instead of Scrapy-style per-request proxy rotation. On Windows, I’ve also had to respect the separate-thread event loop approach because Playwright can’t run in the same asyncio loop as Scrapy’s Twisted reactor there, which adds operational complexity.
Final remarks on Scrapy Playwright
When I’m doing dynamic web scraping at scale, Scrapy Playwright is worth it only when I treat Playwright as a scalpel: render just the hard pages, wait for deterministic selectors, and close pages/contexts like my crawl depends on it—because it does. If the target data comes from an API call, I still prefer hitting the API directly and keeping Scrapy Playwright as a fallback for truly browser-only flows.