<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Crawlee for JavaScript · Build reliable crawlers. Fast. Blog</title>
        <link>https://crawlee.dev/blog</link>
        <description>Crawlee for JavaScript · Build reliable crawlers. Fast. Blog</description>
        <lastBuildDate>Fri, 06 Feb 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <item>
            <title><![CDATA[Crawlee v3.16: AI-Powered Crawling with StagehandCrawler]]></title>
            <link>https://crawlee.dev/blog/crawlee-v3-16</link>
            <guid>https://crawlee.dev/blog/crawlee-v3-16</guid>
            <pubDate>Fri, 06 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Crawlee v3.16 introduces StagehandCrawler for AI-powered browser automation, async iterators for Dataset and KeyValueStore, sitemap discovery, and improved Cloudflare handling.]]></description>
            <content:encoded><![CDATA[<p>Crawlee v3.16 is here, and the headline feature is the new <code>StagehandCrawler</code> — an AI-powered crawler that lets you interact with web pages using natural language instead of CSS selectors. On top of that, we've added async iterators for <code>Dataset</code> and <code>KeyValueStore</code>, a new <code>discoverValidSitemaps</code> utility, and made <code>handleCloudflareChallenge</code> more configurable.</p>
<p>Here's what's new:</p>
<ul>
<li class=""><a href="https://crawlee.dev/blog/crawlee-v3-16#stagehandcrawler--ai-powered-browser-automation">StagehandCrawler — AI-powered browser automation</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-v3-16#async-iterators-for-dataset-and-keyvaluestore">Async iterators for Dataset and KeyValueStore</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-v3-16#discovervalidsitemaps-utility">discoverValidSitemaps utility</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-v3-16#improved-cloudflare-challenge-handling">Improved Cloudflare challenge handling</a></li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="stagehandcrawler--ai-powered-browser-automation">StagehandCrawler — AI-powered browser automation<a href="https://crawlee.dev/blog/crawlee-v3-16#stagehandcrawler--ai-powered-browser-automation" class="hash-link" aria-label="Direct link to StagehandCrawler — AI-powered browser automation" title="Direct link to StagehandCrawler — AI-powered browser automation" translate="no">​</a></h2>
<p>The new <a href="https://crawlee.dev/js/api/stagehand-crawler"><code>@crawlee/stagehand</code></a> package integrates <a href="https://github.com/browserbase/stagehand" target="_blank" rel="noopener noreferrer">Browserbase's Stagehand</a> with Crawlee's crawling infrastructure. Instead of writing brittle CSS selectors or XPath expressions, you describe what you want in plain English and let the AI figure out the rest.</p>
<p>The enhanced page object provides four AI methods:</p>
<ul>
<li class=""><strong><code>page.act(instruction)</code></strong> — perform actions described in natural language (e.g., "Click the 'Load More' button")</li>
<li class=""><strong><code>page.extract(instruction, schema)</code></strong> — extract structured data from the page using Zod schemas for type safety</li>
<li class=""><strong><code>page.observe()</code></strong> — discover available actions on the current page</li>
<li class=""><strong><code>page.agent(config)</code></strong> — create an autonomous agent for complex multi-step workflows</li>
</ul>
<p>Since <a href="https://crawlee.dev/js/api/stagehand-crawler/class/StagehandCrawler"><code>StagehandCrawler</code></a> extends <a href="https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler"><code>BrowserCrawler</code></a>, you get all the standard Crawlee features out of the box — <a href="https://crawlee.dev/js/docs/guides/request-storage">request queues</a>, <a href="https://crawlee.dev/js/docs/guides/proxy-management">proxy rotation</a>, <a href="https://crawlee.dev/js/api/core/class/AutoscaledPool">autoscaling</a>, <a href="https://crawlee.dev/js/docs/guides/session-management">session management</a>, and <a href="https://crawlee.dev/js/docs/guides/avoid-blocking">browser fingerprinting</a>. It's not a separate tool you have to wire up manually; it's a full Crawlee crawler with AI superpowers.</p>
<p>Here's a basic example showing how to interact with a page and extract structured data:</p>
<div class="language-typescript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-typescript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> StagehandCrawler </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'@crawlee/stagehand'</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> z </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'zod'</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">StagehandCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    stagehandOptions</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        model</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'openai/gpt-4.1-mini'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        apiKey</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'your-api-key'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token comment" style="color:#999988;font-style:italic">// Your OpenAI API key (or use OPENAI_API_KEY env var)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">requestHandler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> log </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Processing </span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">request</span><span class="token template-string interpolation punctuation" style="color:#393A34">.</span><span class="token template-string interpolation">url</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic">// Use natural language to interact with the page</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">act</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Click the "Load More" button'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic">// Extract structured data with AI</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">extract</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Get all product names and prices'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            z</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">object</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                products</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> z</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">array</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">z</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">object</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    name</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> z</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">string</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    price</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> z</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">number</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Found </span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">data</span><span class="token template-string interpolation punctuation" style="color:#393A34">.</span><span class="token template-string interpolation">products</span><span class="token template-string interpolation punctuation" style="color:#393A34">.</span><span class="token template-string interpolation">length</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c"> products</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://example.com'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>The <code>StagehandCrawler</code> is especially useful for websites with complex or frequently changing layouts where traditional selectors are hard to maintain. If the target website has a stable structure, <a href="https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler"><code>PlaywrightCrawler</code></a> remains the better choice — it's faster and doesn't require AI API keys.</p>
<p><strong>Installation:</strong></p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">npm</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> @crawlee/stagehand @browserbasehq/stagehand</span><br></div></code></pre></div></div>
<p>For a deeper dive into the architecture, all four AI methods, configuration options, and more examples, check out the <a href="https://crawlee.dev/js/docs/guides/stagehand-crawler-guide">StagehandCrawler guide</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="async-iterators-for-dataset-and-keyvaluestore">Async iterators for Dataset and KeyValueStore<a href="https://crawlee.dev/blog/crawlee-v3-16#async-iterators-for-dataset-and-keyvaluestore" class="hash-link" aria-label="Direct link to Async iterators for Dataset and KeyValueStore" title="Direct link to Async iterators for Dataset and KeyValueStore" translate="no">​</a></h2>
<p>Previously, iterating over all items in a <a href="https://crawlee.dev/js/api/core/class/Dataset"><code>Dataset</code></a> or all keys in a <a href="https://crawlee.dev/js/api/core/class/KeyValueStore"><code>KeyValueStore</code></a> required manual pagination with <code>getData()</code> or <code>forEachKey()</code>. This release adds <code>for await...of</code> support, making iteration straightforward and memory-efficient.</p>
<p>Both <code>Dataset</code> and <code>KeyValueStore</code> now support direct iteration as well as <code>values()</code>, <code>entries()</code>, and <code>keys()</code> methods:</p>
<div class="language-typescript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-typescript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> Dataset</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> KeyValueStore </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'crawlee'</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// Dataset — iterate over all items</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> dataset </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Dataset</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> item </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> dataset</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token builtin">console</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">log</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">item</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// Or use values()/entries() for more control</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">index</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> dataset</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">entries</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token builtin">console</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">log</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Item #</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">index</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c">:</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// KeyValueStore — iterate over entries</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> kvs </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> KeyValueStore</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> value</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> kvs</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token builtin">console</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">log</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> value</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// Or iterate over just keys or values</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> key </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> kvs</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">keys</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token builtin">console</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">log</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> value </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> kvs</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">values</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token builtin">console</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">log</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">value</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>The iteration handles pagination internally, so you don't have to worry about offsets or cursors. Existing code that uses <code>await</code> on <code>listItems()</code> or <code>listKeys()</code> continues to work unchanged — the methods now return hybrid objects that support both <code>await</code> and <code>for await...of</code>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="discovervalidsitemaps-utility">discoverValidSitemaps utility<a href="https://crawlee.dev/blog/crawlee-v3-16#discovervalidsitemaps-utility" class="hash-link" aria-label="Direct link to discoverValidSitemaps utility" title="Direct link to discoverValidSitemaps utility" translate="no">​</a></h2>
<p>The new <a href="https://crawlee.dev/js/api/utils/function/discoverValidSitemaps"><code>discoverValidSitemaps</code></a> async generator in <code>@crawlee/utils</code> takes a list of URLs and automatically discovers sitemap files for those domains. It checks <code>robots.txt</code> for sitemap declarations, then tries common paths like <code>/sitemap.xml</code>, <code>/sitemap.txt</code>, and <code>/sitemap_index.xml</code>.</p>
<div class="language-typescript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-typescript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> discoverValidSitemaps </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'@crawlee/utils'</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> sitemapUrl </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">discoverValidSitemaps</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://example.com'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token builtin">console</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">log</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Found sitemap:'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> sitemapUrl</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>This is handy when you want to seed a crawl from sitemaps without knowing the exact sitemap URL upfront.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="improved-cloudflare-challenge-handling">Improved Cloudflare challenge handling<a href="https://crawlee.dev/blog/crawlee-v3-16#improved-cloudflare-challenge-handling" class="hash-link" aria-label="Direct link to Improved Cloudflare challenge handling" title="Direct link to Improved Cloudflare challenge handling" translate="no">​</a></h2>
<p>The <a href="https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils"><code>handleCloudflareChallenge</code></a> helper now accepts configuration callbacks for more control over how Cloudflare challenges are detected and solved. The new options include:</p>
<ul>
<li class=""><strong><code>clickPositionCallback</code></strong> — override how the checkbox click position is calculated</li>
<li class=""><strong><code>clickCallback</code></strong> — override the actual checkbox clicking logic</li>
<li class=""><strong><code>isChallengeCallback</code></strong> — customize detection of Cloudflare challenge pages</li>
<li class=""><strong><code>isBlockedCallback</code></strong> — customize detection of Cloudflare block pages</li>
<li class=""><strong><code>preChallengeSleepSecs</code></strong> — add a delay before the first click attempt (defaults to 1s)</li>
</ul>
<div class="language-typescript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-typescript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> PlaywrightCrawler </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'crawlee'</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    postNavigationHooks</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> handleCloudflareChallenge </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handleCloudflareChallenge</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token comment" style="color:#999988;font-style:italic">// Custom click position for environments where the</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token comment" style="color:#999988;font-style:italic">// default detection doesn't work</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token function-variable function" style="color:#d73a49">clickPositionCallback</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> box </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'iframe'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">first</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">boundingBox</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> box </span><span class="token operator" style="color:#393A34">?</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> x</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> box</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">x </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">25</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> y</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> box</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">y </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">25</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">null</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                preChallengeSleepSecs</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">// ...</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>These options are particularly useful when running in environments where the default checkbox detection needs adjustment.</p>
<hr>
<p>That's a wrap for Crawlee v3.16! For the full list of changes, check out the <a href="https://github.com/apify/crawlee/blob/master/CHANGELOG.md" target="_blank" rel="noopener noreferrer">changelog on GitHub</a>. If you have questions or feedback, <a href="https://github.com/apify/crawlee/discussions" target="_blank" rel="noopener noreferrer">open a GitHub discussion</a> or <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">join our Discord community</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Crawlee for Python v1]]></title>
            <link>https://crawlee.dev/blog/crawlee-for-python-v1</link>
            <guid>https://crawlee.dev/blog/crawlee-for-python-v1</guid>
            <pubDate>Mon, 15 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Announcing the Crawlee for Python v1.0 release.]]></description>
            <content:encoded><![CDATA[<p>We launched Crawlee for Python in beta mode in <a href="https://www.crawlee.dev/blog/launching-crawlee-python" target="_blank" rel="noopener noreferrer">July 2024</a>. Over the past year, we received many early adopters, tremendous interest in the library from the Python community, more than 6000 stars on GitHub, a dozen contributors, and many feature requests.</p>
<p>After months of development, polishing, and community feedback, the library is leaving beta and entering a production/stable development status.</p>
<p><strong>We are happy to announce Crawlee for Python v1.0.</strong></p>
<p>From now on, Crawlee for Python will strictly follow <a href="https://www.semver.org/" target="_blank" rel="noopener noreferrer">semantic versioning</a>. You can now rely on it as a stable foundation for your crawling and scraping projects, knowing that breaking changes will only occur in major releases.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="whats-new-in-crawlee-for-python-v1">What's new in Crawlee for Python v1<a href="https://crawlee.dev/blog/crawlee-for-python-v1#whats-new-in-crawlee-for-python-v1" class="hash-link" aria-label="Direct link to What's new in Crawlee for Python v1" title="Direct link to What's new in Crawlee for Python v1" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://crawlee.dev/blog/crawlee-for-python-v1#new-storage-client-system">New storage client system</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-for-python-v1#adaptive-playwright-crawler">Adaptive Playwright crawler</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-for-python-v1#impit-http-client">Impit HTTP client</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-for-python-v1#sitemap-request-loader">Sitemap request loader</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-for-python-v1#robots-exclusion-standard">Robots exclusion standard</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-for-python-v1#fingerprinting">Fingerprinting</a></li>
<li class=""><a href="https://crawlee.dev/blog/crawlee-for-python-v1#open-telemetry">Open telemetry</a></li>
</ul>
<p><img decoding="async" loading="lazy" alt="Crawlee for Python v1.0" src="https://crawlee.dev/assets/images/crawlee_v100-d491a6c5406c55e0bfcdc9b39b81b7ae.webp" width="1578" height="840" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="getting-started">Getting started<a href="https://crawlee.dev/blog/crawlee-for-python-v1#getting-started" class="hash-link" aria-label="Direct link to Getting started" title="Direct link to Getting started" translate="no">​</a></h2>
<p>You can upgrade to the latest version straight from <a href="https://www.pypi.org/project/crawlee/" target="_blank" rel="noopener noreferrer">PyPI</a>:</p>
<div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">--upgrade</span><span class="token plain"> crawlee</span><br></div></code></pre></div></div>
<p>Check out the full changelog on our <a href="https://www.crawlee.dev/python/docs/changelog#100-2025-09-15" target="_blank" rel="noopener noreferrer">website</a> to see all the details. If you are updating from an older version, make sure to follow our <a href="https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v1" target="_blank" rel="noopener noreferrer">Upgrading to v1</a> guide.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="new-storage-client-system">New storage client system<a href="https://crawlee.dev/blog/crawlee-for-python-v1#new-storage-client-system" class="hash-link" aria-label="Direct link to New storage client system" title="Direct link to New storage client system" translate="no">​</a></h2>
<p>One of the biggest architectural changes in Crawlee v1 is the introduction of a new storage client system. Until now, datasets, key–value stores, and request queues were handled in slightly different ways depending on where they were stored. With v1, this has been unified under a single, consistent interface.</p>
<p>This means that whether you're storing data in memory, on the local file system, in a database, on the Apify platform, or even using a custom backend, the API remains the same. The result is less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations.</p>
<p>For example, here's how to set up a crawler with a file-system–backed storage client, which persists data locally:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">configuration </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Configuration</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storage_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> FileSystemStorageClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Create a new instance of storage client.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">storage_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> FileSystemStorageClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Create a configuration with custom settings.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">configuration </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    storage_dir</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'./my_storage'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    purge_on_start</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># And pass them to the crawler.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    storage_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">storage_client</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">configuration</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>And here's an example of using a memory-only storage client, useful for testing or short-lived crawls:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storage_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> MemoryStorageClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Create a new instance of storage client.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">storage_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> MemoryStorageClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># And pass it to the crawler.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">storage_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">storage_client</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>With this new design, switching between storage backends is as simple as swapping out a client, without changing your crawling logic. To dive deeper into configuration, advanced usage (e.g. using different storage clients for specific storage instances), and even how to write your own storage client, see the <a href="https://www.crawlee.dev/python/docs/guides/storages" target="_blank" rel="noopener noreferrer">Storages</a> and <a href="https://www.crawlee.dev/python/docs/guides/storage-clients" target="_blank" rel="noopener noreferrer">Storage clients</a> guides.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="new-experimental-sql-storage-client">New experimental SQL storage client<a href="https://crawlee.dev/blog/crawlee-for-python-v1#new-experimental-sql-storage-client" class="hash-link" aria-label="Direct link to New experimental SQL storage client" title="Direct link to New experimental SQL storage client" translate="no">​</a></h3>
<p>Crawlee v1 introduces an experimental <a href="https://www.crawlee.dev/python/api/class/SqlStorageClient" target="_blank" rel="noopener noreferrer"><code>SqlStorageClient</code></a> that enables persistent storage using SQL databases. Currently, SQLite and PostgreSQL are supported. This storage backend supports concurrent access from multiple crawler processes, enabling distributed crawling scenarios.</p>
<p>The SQL storage client uses <a href="https://www.sqlalchemy.org/" target="_blank" rel="noopener noreferrer">SQLAlchemy 2+</a> under the hood, providing automatic schema creation, connection pooling, and database-specific optimizations. It maintains the same interface as other storage clients, making it easy to switch between different storage backends without changing your crawling logic.</p>
<p>The client uses a context manager to ensure proper connection handling:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storage_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SqlStorageClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create SQL storage client (defaults to SQLite).</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> SqlStorageClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> storage_client</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Pass it to the crawler.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">storage_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">storage_client</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># ... define your handlers and crawling logic</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>For PostgreSQL, simply provide a connection string:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storage_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SqlStorageClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> SqlStorageClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        connection_string</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'postgresql+asyncpg://user:pass@localhost/crawlee_db'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> storage_client</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">storage_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">storage_client</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># ... define your handlers and crawling logic</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Since this is an experimental feature, the implementation may evolve in future releases as we gather feedback from the community.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="adaptive-playwright-crawler">Adaptive Playwright crawler<a href="https://crawlee.dev/blog/crawlee-for-python-v1#adaptive-playwright-crawler" class="hash-link" aria-label="Direct link to Adaptive Playwright crawler" title="Direct link to Adaptive Playwright crawler" translate="no">​</a></h2>
<p>Some websites can be scraped quickly with plain HTTP requests, while others require the full power of a browser to render dynamic content. Traditionally, you had to decide upfront whether to use one of the lightweight HTTP-based crawlers (<a href="https://www.crawlee.dev/python/api/class/ParselCrawler" target="_blank" rel="noopener noreferrer"><code>ParselCrawler</code></a> or <a href="https://www.crawlee.dev/python/api/class/BeautifulSoupCrawler" target="_blank" rel="noopener noreferrer"><code>BeautifulSoupCrawler</code></a>) or a browser-based <a href="https://www.crawlee.dev/python/api/class/PlaywrightCrawler" target="_blank" rel="noopener noreferrer"><code>PlaywrightCrawler</code></a>. Crawlee v1 introduces the <a href="https://www.crawlee.dev/python/api/class/AdaptivePlaywrightCrawler" target="_blank" rel="noopener noreferrer"><code>AdaptivePlaywrightCrawler</code></a>, which automatically chooses the right approach for each page.</p>
<p>The adaptive crawler uses a detection mechanism: it compares the results of plain HTTP requests with those of a browser-rendered version of the same page. If both match, it can continue with the faster HTTP approach; if differences appear, it falls back to browser-based crawling. Over time, it builds confidence about which rendering type is needed for different pages, occasionally re-checking with the browser to ensure its predictions stay correct.</p>
<p>This makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites. For advanced options, such as customizing the detection strategy, see the <a href="https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler" target="_blank" rel="noopener noreferrer">Adaptive Playwright crawler guide</a>.</p>
<p>Here's a simplified example using the static <a href="https://www.github.com/scrapy/parsel" target="_blank" rel="noopener noreferrer">Parsel</a> parser for HTTP responses, and falling back to <a href="https://www.playwright.dev/" target="_blank" rel="noopener noreferrer">Playwright</a> only when needed:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> AdaptivePlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> AdaptivePlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> AdaptivePlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">with_parsel_static_parser</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> AdaptivePlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Locate element h2 within 5 seconds</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        h2 </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'h2'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">milliseconds</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Do stuff with element found by the selector</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">h2</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev/'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>In this example, pages that don't need JavaScript rendering will be processed through the fast HTTP client, while others will be automatically handled with Playwright. You don't need to write two different crawlers or guess in advance which method to use - Crawlee adapts dynamically. For more details and configuration options, see the <a href="https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler" target="_blank" rel="noopener noreferrer">Adaptive Playwright crawler</a> guide.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="impit-http-client">Impit HTTP client<a href="https://crawlee.dev/blog/crawlee-for-python-v1#impit-http-client" class="hash-link" aria-label="Direct link to Impit HTTP client" title="Direct link to Impit HTTP client" translate="no">​</a></h2>
<p>Crawlee v1 introduces a brand-new default HTTP client: <a href="https://www.crawlee.dev/python/api/class/ImpitHttpClient" target="_blank" rel="noopener noreferrer"><code>ImpitHttpClient</code></a>, powered by the <a href="https://www.github.com/apify/impit" target="_blank" rel="noopener noreferrer">Impit</a> library. Written in Rust and exposed to Python through bindings, it delivers better performance, async-first design, HTTP/3 support, and browser impersonation. It can impersonate real browsers out of the box, which makes your crawlers harder to detect and block by common anti-bot systems. This means fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself.</p>
<p>By default, Crawlee now uses <a href="https://www.crawlee.dev/python/api/class/ImpitHttpClient" target="_blank" rel="noopener noreferrer"><code>ImpitHttpClient</code></a> under the hood. But you can also create your own instance, configure it to your needs (e.g. enable HTTP/3 or choose a specific browser profile), and pass it into your crawler.</p>
<p>Here's an example of explicitly using <a href="https://www.crawlee.dev/python/api/class/ImpitHttpClient" target="_blank" rel="noopener noreferrer"><code>ImpitHttpClient</code></a> with a <a href="https://www.crawlee.dev/python/api/class/ParselCrawler" target="_blank" rel="noopener noreferrer"><code>ParselCrawler</code></a>:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ImpitHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    http_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ImpitHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Optional additional keyword arguments for `impit.AsyncClient`.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http3</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        browser</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'firefox'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        verify</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">http_client</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Limit the crawl to max requests. Remove or increase it for crawling all links.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Define the default request handler, which will be called for every request.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Enqueue all links from the page.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract data from the page.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">css</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'title::text'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Push the extracted data to the default dataset.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the initial list of URLs.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>With the <a href="https://www.crawlee.dev/python/api/class/ImpitHttpClient" target="_blank" rel="noopener noreferrer"><code>ImpitHttpClient</code></a>, you get stealth without extra dependencies or plugins. Check out the <a href="https://www.crawlee.dev/python/docs/guides/http-clients" target="_blank" rel="noopener noreferrer">HTTP clients</a> guide for more details and advanced configuration options.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="sitemap-request-loader">Sitemap request loader<a href="https://crawlee.dev/blog/crawlee-for-python-v1#sitemap-request-loader" class="hash-link" aria-label="Direct link to Sitemap request loader" title="Direct link to Sitemap request loader" translate="no">​</a></h2>
<p>Many websites expose their structure through sitemaps. These files provide a clear list of all available URLs, and are often the most efficient way to discover content on a site. In previous Crawlee versions, you had to fetch and parse these XML files manually before feeding them into your crawler. With Crawlee v1, that's no longer necessary.</p>
<p>The new <a href="https://www.crawlee.dev/python/api/class/SitemapRequestLoader" target="_blank" rel="noopener noreferrer"><code>SitemapRequestLoader</code></a> lets you load URLs directly from a sitemap into your request queue, with options for filtering and batching. This makes it much easier to start large-scale crawls where sitemaps already provide full coverage of the site.</p>
<p>Here's an example that loads a sitemap, filters out only documentation pages, and processes them with a <a href="https://www.crawlee.dev/python/api/class/ParselCrawler" target="_blank" rel="noopener noreferrer"><code>ParselCrawler</code></a>:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> re</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ImpitHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request_loaders </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SitemapRequestLoader</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create an HTTP client for fetching the sitemap.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    http_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ImpitHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a sitemap request loader with filtering rules.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    sitemap_loader </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> SitemapRequestLoader</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        sitemap_urls</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev/sitemap.xml'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">http_client</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        include</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">re</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">compile</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">r'.*docs.*'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Only include URLs containing 'docs'.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_buffer_size</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">500</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Keep up to 500 URLs in memory before processing.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Convert the sitemap loader into a request manager linked</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># to the default request queue.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    request_manager </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> sitemap_loader</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">to_tandem</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a crawler and pass the request manager to it.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_manager</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">request_manager</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Limit the max requests per crawl.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># New links will be enqueued directly to the queue.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract data using Parsel's XPath and CSS selectors.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">xpath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'//title/text()'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Push extracted data to the dataset.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>By connecting the <a href="https://www.crawlee.dev/python/api/class/SitemapRequestLoader" target="_blank" rel="noopener noreferrer"><code>SitemapRequestLoader</code></a> directly with a crawler, you can skip the boilerplate of parsing XML and just focus on extracting data. For more details, see the <a href="https://www.crawlee.dev/python/docs/guides/request-loaders" target="_blank" rel="noopener noreferrer">Request loaders</a> guide.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="robots-exclusion-standard">Robots exclusion standard<a href="https://crawlee.dev/blog/crawlee-for-python-v1#robots-exclusion-standard" class="hash-link" aria-label="Direct link to Robots exclusion standard" title="Direct link to Robots exclusion standard" translate="no">​</a></h2>
<p>Respecting <a href="https://en.wikipedia.org/wiki/Robots.txt" target="_blank" rel="noopener noreferrer"><code>robots.txt</code></a> is an important part of responsible web crawling. This simple file lets website owners declare which parts of their site should not be crawled by automated agents. Crawlee v1 makes it trivial to follow these rules: just set the <code>respect_robots_txt_file</code> option on your crawler, and Crawlee will automatically check the file before issuing requests.</p>
<p>This not only helps you build ethical crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages. For example, login pages, search results, or admin sections are often excluded in <a href="https://www.en.wikipedia.org/wiki/Robots.txt" target="_blank" rel="noopener noreferrer"><code>robots.txt</code></a>, and Crawlee will handle that for you automatically.</p>
<p>Here's a minimal example showing how a <a href="https://www.crawlee.dev/python/api/class/ParselCrawler" target="_blank" rel="noopener noreferrer"><code>ParselCrawler</code></a> obeys the robots exclusion standard:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a new crawler instance with robots.txt compliance enabled.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        respect_robots_txt_file</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Define the default request handler.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract the data from website.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">xpath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'//title/text()'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Push extracted data to the dataset.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the list of start URLs.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># The crawler will check the robots.txt file before making requests.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># In this example, "https://news.ycombinator.com/login" will be skipped</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># because it's disallowed in the site's robots.txt file.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://news.ycombinator.com/'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://news.ycombinator.com/login'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>With this option enabled, you don't need to manually check which URLs are allowed. Crawlee will handle it, letting you focus on the crawling logic and data extraction. For a more information, see the <a href="https://www.crawlee.dev/python/docs/examples/respect-robots-txt-file" target="_blank" rel="noopener noreferrer">Respect robots.txt file</a> documentation page.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="fingerprinting">Fingerprinting<a href="https://crawlee.dev/blog/crawlee-for-python-v1#fingerprinting" class="hash-link" aria-label="Direct link to Fingerprinting" title="Direct link to Fingerprinting" translate="no">​</a></h2>
<p>Modern websites often rely on browser fingerprinting to distinguish real users from automated traffic. Instead of just checking the <a href="https://www.developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent" target="_blank" rel="noopener noreferrer">User-Agent</a> header, they combine dozens of subtle signals - supported fonts, canvas rendering, WebGL features, media devices, screen resolution, and more. Together, these form a unique <a href="https://www.en.wikipedia.org/wiki/Device_fingerprint" target="_blank" rel="noopener noreferrer">device fingerprint</a> that can easily expose headless browsers or automation frameworks.</p>
<p>Without fingerprinting, Playwright sessions tend to look identical and are more likely to be flagged by anti-bot systems. Crawlee v1 integrates with the <a href="https://www.crawlee.dev/python/api/class/FingerprintGenerator" target="_blank" rel="noopener noreferrer"><code>FingerprintGenerator</code></a> to automatically inject realistic, randomized fingerprints into every <a href="https://www.crawlee.dev/python/api/class/PlaywrightCrawler" target="_blank" rel="noopener noreferrer"><code>PlaywrightCrawler</code></a> session. This modifies HTTP headers, browser APIs, and other low-level signals so that each crawler run looks like a real browser on a real device.</p>
<p>Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> PlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">fingerprint_suite </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    DefaultFingerprintGenerator</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    HeaderGeneratorOptions</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ScreenOptions</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Use default fingerprint generator with desired fingerprint options.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Generator will generate real looking browser fingerprint based on the options.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Unspecified fingerprint options will be automatically selected by the generator.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    fingerprint_generator </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> DefaultFingerprintGenerator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        header_options</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HeaderGeneratorOptions</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">browsers</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'chrome'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        screen_options</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">ScreenOptions</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">min_width</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">400</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Limit the crawl to max requests. Remove or increase it for crawling all links.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Headless mode, set to False to see the browser in action.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Browser types supported by Playwright.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        browser_type</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'chromium'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Fingerprint generator to be used. By default no fingerprint generation is done.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        fingerprint_generator</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">fingerprint_generator</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Define the default request handler, which will be called for every request.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Find a link to the next page and enqueue it if it exists.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">selector</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'.morelink'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the initial list of URLs.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://news.ycombinator.com/'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>In this example, each Playwright instance starts with a unique, realistic fingerprint. From the website’s perspective, the crawler behaves like a real browser session, reducing the chance of detection or blocking. For more details and examples, see the <a href="https://www.crawlee.dev/python/docs/guides/avoid-blocking" target="_blank" rel="noopener noreferrer">Avoid getting blocked</a> guide and the <a href="https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-fingeprint-generator" target="_blank" rel="noopener noreferrer">Playwright crawler with fingerprint generator</a> documentation page.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="open-telemetry">Open telemetry<a href="https://crawlee.dev/blog/crawlee-for-python-v1#open-telemetry" class="hash-link" aria-label="Direct link to Open telemetry" title="Direct link to Open telemetry" translate="no">​</a></h2>
<p>Running crawlers in production means you often want more than just logs - you need visibility into what the crawler is doing, how it's performing, and where bottlenecks occur. Crawlee v1 adds basic <a href="https://www.opentelemetry.io/" target="_blank" rel="noopener noreferrer">OpenTelemetry</a> instrumentation via <a href="https://www.crawlee.dev/python/api/class/CrawlerInstrumentor" target="_blank" rel="noopener noreferrer"><code>CrawlerInstrumentor</code></a>, giving you a standardized way to collect traces and metrics from your crawlers.</p>
<p>With <a href="https://www.opentelemetry.io/" target="_blank" rel="noopener noreferrer">OpenTelemetry</a> enabled, Crawlee automatically records information such as:</p>
<ul>
<li class="">Requests and responses (including timings, retries, and errors).</li>
<li class="">Resource usage events (memory, concurrency, system snapshots).</li>
<li class="">Lifecycle events from crawlers, routers, and handlers.</li>
</ul>
<p>These signals can be exported to any OpenTelemetry-compatible backend (e.g. <a href="https://www.jaegertracing.io/" target="_blank" rel="noopener noreferrer">Jaeger</a>, <a href="https://www.prometheus.io/" target="_blank" rel="noopener noreferrer">Prometheus</a>, or <a href="https://www.grafana.com/" target="_blank" rel="noopener noreferrer">Grafana</a>), where you can monitor real-time dashboards or analyze traces to understand crawler performance.</p>
<p>Here's a minimal example:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> opentelemetry</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">exporter</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">otlp</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">proto</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">grpc</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">trace_exporter </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> OTLPSpanExporter</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> opentelemetry</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sdk</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">resources </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Resource</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> opentelemetry</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sdk</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">trace </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> TracerProvider</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> opentelemetry</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sdk</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">trace</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SimpleSpanProcessor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> opentelemetry</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">trace </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> set_tracer_provider</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BasicCrawlingContext</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">otel </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CrawlerInstrumentor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storages </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Dataset</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> KeyValueStore</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> RequestQueue</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">instrument_crawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    resource </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Resource</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'service.name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'ExampleCrawler'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'service.version'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'1.0.0'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'environment'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'development'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Set up the OpenTelemetry tracer provider and exporter</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    provider </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> TracerProvider</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">resource</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">resource</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    otlp_exporter </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> OTLPSpanExporter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">endpoint</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'localhost:4317'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> insecure</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    provider</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_span_processor</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">SimpleSpanProcessor</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">otlp_exporter</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    set_tracer_provider</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">provider</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Instrument the crawler with OpenTelemetry</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    CrawlerInstrumentor</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        instrument_classes</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">RequestQueue</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> KeyValueStore</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Dataset</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">instrument</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    instrument_crawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    kvs </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> KeyValueStore</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">pre_navigation_hook</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">pre_nav_hook</span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">_</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BasicCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Simulate some pre-navigation processing</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sleep</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0.01</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> kvs</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">set_value</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> value</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev/'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Once configured, your traces and metrics can be exported using standard OpenTelemetry exporters (e.g. OTLP, console, or custom backends). This makes it much easier to integrate Crawlee into existing monitoring pipelines. For more details on available options and examples of exporting traces, see the <a href="https://www.crawlee.dev/python/docs/guides/trace-and-monitor-crawlers" target="_blank" rel="noopener noreferrer">Trace and monitor crawlers</a> guide.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="a-message-from-the-crawlee-team">A message from the Crawlee team<a href="https://crawlee.dev/blog/crawlee-for-python-v1#a-message-from-the-crawlee-team" class="hash-link" aria-label="Direct link to A message from the Crawlee team" title="Direct link to A message from the Crawlee team" translate="no">​</a></h2>
<p>Last but not least, we want to thank our open-source community members who tried Crawlee for Python in its beta version and helped us improve it for the scraping and automation community.</p>
<p>We would appreciate it if you could check out the latest version and <a href="https://www.github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">give us a star on GitHub</a> if you like the new features.</p>
<p>If you have any questions or feedback, please open a <a href="https://www.github.com/apify/crawlee-python/discussions" target="_blank" rel="noopener noreferrer">GitHub discussion</a> or <a href="https://www.apify.com/discord/" target="_blank" rel="noopener noreferrer">join our Discord community</a> to get support or talk to fellow Crawlee users. If you encounter any bugs or have an idea for a new feature, please open a <a href="https://www.github.com/apify/crawlee-python/issues" target="_blank" rel="noopener noreferrer">GitHub issue</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to scrape YouTube using Python [2025 guide]]]></title>
            <link>https://crawlee.dev/blog/scrape-youtube-python</link>
            <guid>https://crawlee.dev/blog/scrape-youtube-python</guid>
            <pubDate>Mon, 14 Jul 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape YouTube using Crawlee for Python]]></description>
            <content:encoded><![CDATA[<p>In this guide, we'll explore how to efficiently collect data from YouTube using <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">Crawlee for Python</a>. The scraper will extract video metadata, video statistics, and transcripts - giving you structured YouTube data perfect for content analysis, ML training, or trend monitoring.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this guide as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on Apify’s <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">Discord channel</a>.</p></div></div>
<p><img decoding="async" loading="lazy" alt="How to scrape YouTube using Python" src="https://crawlee.dev/assets/images/youtube_banner-fb73d10d52bbf13a89f3c0d66d2eff5b.webp" width="1152" height="649" class="img_ev3q"></p>
<p>Key steps we'll cover:</p>
<ol>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-youtube-python#1-project-setup" target="_blank" rel="noopener noreferrer">Project setup</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-youtube-python#2-analyzing-youtube-and-determining-a-scraping-strategy" target="_blank" rel="noopener noreferrer">Analyzing YouTube and determining a scraping strategy</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-youtube-python#3-configuring-crawlee" target="_blank" rel="noopener noreferrer">Configuring YouTube</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-youtube-python#4-extracting-youtube-data" target="_blank" rel="noopener noreferrer">Extracting YouTube data</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-youtube-python#5-enhancing-the-scraper-capabilities" target="_blank" rel="noopener noreferrer">Enhancing the scraper capabilities</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-youtube-python#6-creating-a-youtube-actor-on-the-apify-platform" target="_blank" rel="noopener noreferrer">Creating a YouTube Actor on the Apify platform</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-youtube-python#7-deploying-to-apify" target="_blank" rel="noopener noreferrer">Deploying to Apify</a></li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-youll-need-to-get-started">What you’ll need to get started<a href="https://crawlee.dev/blog/scrape-youtube-python#what-youll-need-to-get-started" class="hash-link" aria-label="Direct link to What you’ll need to get started" title="Direct link to What you’ll need to get started" translate="no">​</a></h2>
<ul>
<li class="">Python 3.10 or higher</li>
<li class="">Familiarity with web scraping concepts</li>
<li class="">Crawlee for Python <code>v0.6.0</code> or higher</li>
<li class=""><a href="https://docs.astral.sh/uv/" target="_blank" rel="noopener noreferrer">uv</a> <code>v0.7</code> or higher</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-project-setup">1. Project setup<a href="https://crawlee.dev/blog/scrape-youtube-python#1-project-setup" class="hash-link" aria-label="Direct link to 1. Project setup" title="Direct link to 1. Project setup" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before starting the project, I'd like to ask you to star Crawlee for Python on <a href="https://github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">GitHub</a>. This will help us spread the word to fellow scraper developers.</p></div></div>
<p>In this project, we'll use uv for package management and a specific Python version will be installed through uv. If you don't have uv installed yet, just follow the <a href="https://docs.astral.sh/uv/getting-started/installation/" target="_blank" rel="noopener noreferrer">guide</a> or use this command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">curl</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-LsSf</span><span class="token plain"> https://astral.sh/uv/install.sh </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">sh</span><br></div></code></pre></div></div>
<p>To create the project, run:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uvx crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cli'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> create youtube-crawlee</span><br></div></code></pre></div></div>
<p>In the <code>cli</code> menu that opens, select:</p>
<ol>
<li class=""><code>Playwright</code></li>
<li class=""><code>Httpx</code></li>
<li class=""><code>uv</code></li>
<li class="">Leave the default value - <code>https://crawlee.dev</code></li>
<li class=""><code>y</code></li>
</ol>
<p>Or, just run the command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uvx crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cli'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv </span><span class="token parameter variable" style="color:#36acaa">--apify</span><span class="token plain"> --start-url </span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><br></div></code></pre></div></div>
<p>Or, if you prefer to use <code>pipx</code>.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pipx run crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cli'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv </span><span class="token parameter variable" style="color:#36acaa">--apify</span><span class="token plain"> --start-url </span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><br></div></code></pre></div></div>
<p>Creating the project may take a few minutes. After installation is complete, navigate to the project folder:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token builtin class-name">cd</span><span class="token plain"> youtube-crawlee</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-analyzing-youtube-and-determining-a-scraping-strategy">2. Analyzing YouTube and determining a scraping strategy<a href="https://crawlee.dev/blog/scrape-youtube-python#2-analyzing-youtube-and-determining-a-scraping-strategy" class="hash-link" aria-label="Direct link to 2. Analyzing YouTube and determining a scraping strategy" title="Direct link to 2. Analyzing YouTube and determining a scraping strategy" translate="no">​</a></h2>
<p>If you're working on a small project to extract data from YouTube, you should use the <a href="https://developers.google.com/youtube/v3/docs/search/list" target="_blank" rel="noopener noreferrer">YouTube API</a> to get your data. However, the API has very strict quotas, with no more than <a href="https://developers.google.com/youtube/v3/determine_quota_cost" target="_blank" rel="noopener noreferrer">10,000 units per day</a>. This allows you to get just 100 search pages, and you can't increase this limit.</p>
<p>If your project requires more data than the API allows, you'll need to use crawling. Let's examine the site to develop an optimal crawling strategy.</p>
<p>Let's study YouTube navigation using <a href="https://www.youtube.com/@Apify" target="_blank" rel="noopener noreferrer">Apify's YouTube channel</a> as an example to better understand the features and data extraction points.</p>
<p>YouTube uses infinite scrolling to load new elements on the page, similar to what we discussed in the corresponding <a href="https://www.crawlee.dev/blog/infinite-scroll-using-python" target="_blank" rel="noopener noreferrer">article</a> from the <a href="https://apify.com/" target="_blank" rel="noopener noreferrer">Apify</a> team. Let's look at how this works using <a href="https://developer.chrome.com/docs/devtools" target="_blank" rel="noopener noreferrer">DevTools</a> and the <a href="https://developer.chrome.com/docs/devtools/network/" target="_blank" rel="noopener noreferrer">Network</a> tab.</p>
<p><img decoding="async" loading="lazy" alt="Load Request" src="https://crawlee.dev/assets/images/load_request-c583830dda107ae55fb6426d7b96e569.webp" width="1917" height="1066" class="img_ev3q"></p>
<p>If we look at the response structure, we can see that YouTube uses <a href="https://www.json.org/" target="_blank" rel="noopener noreferrer">JSON</a> to transmit data, but its structure is quite complex to navigate.</p>
<p><img decoding="async" loading="lazy" alt="Load Response" src="https://crawlee.dev/assets/images/load_response-7061bb91cadc904d54073c033f3f0a20.webp" width="1919" height="1064" class="img_ev3q"></p>
<p>Therefore, we'll use <a href="https://playwright.dev/python/docs/intro" target="_blank" rel="noopener noreferrer">Playwright</a> for crawling, which will help us avoid parsing complex JSON responses. But if you want to practice crawling complex websites, try implementing a crawler based on an HTTP client, like in this <a href="https://www.crawlee.dev/blog/scraping-dynamic-websites-using-python" target="_blank" rel="noopener noreferrer">article</a>.</p>
<p>Let's analyze the selectors for getting video links using the <a href="https://developer.chrome.com/docs/devtools/elements/" target="_blank" rel="noopener noreferrer">Elements</a> tab:</p>
<p><img decoding="async" loading="lazy" alt="Selectors" src="https://crawlee.dev/assets/images/selectors-745f5daab12810cc998990e4c066afdf.webp" width="1919" height="1064" class="img_ev3q"></p>
<p>It looks like we're interested in <code>a</code> tags with the attribute <code>id="video-title-link"</code>!</p>
<p>Let's look at the video page to understand better how YouTube transmits data. As expected, we see data in JSON format.</p>
<p><img decoding="async" loading="lazy" alt="Video Response" src="https://crawlee.dev/assets/images/video_json-44affd2ba348740caa8d1bc79ba9a8a9.webp" width="1919" height="1061" class="img_ev3q"></p>
<p>Now let's get the transcript link. Click on the subtitles button in the player to trigger the transcript request.</p>
<p><img decoding="async" loading="lazy" alt="Transcript Request" src="https://crawlee.dev/assets/images/transcript_request-77c78163912afe398161b431c20cb733.webp" width="1917" height="1074" class="img_ev3q"></p>
<p>Let's verify that we can access the transcript via this link. Remove the <code>fmt=json3</code> parameter from the URL and open it in your browser. Removing the <code>fmt</code> parameter is necessary to get the data in a convenient XML format instead of the complex JSON3 format.</p>
<p><img decoding="async" loading="lazy" alt="Transcript Response" src="https://crawlee.dev/assets/images/transcript_response-06133506fa3559a10cfc43912d1af67c.webp" width="1637" height="1107" class="img_ev3q"></p>
<p>If you live in a country where <a href="https://gdpr-info.eu/" target="_blank" rel="noopener noreferrer">GDPR</a> applies, you'll need to handle the following pop-up before you can access the data:</p>
<p><img decoding="async" loading="lazy" alt="GDPR" src="https://crawlee.dev/assets/images/GDPR-103ec4d5f927916f704ec1d4d597bd82.webp" width="1919" height="1071" class="img_ev3q"></p>
<p>After our analysis, we now understand:</p>
<ul>
<li class=""><strong>Navigation strategy</strong>: How to navigate the channel page to retrieve all videos using infinite scroll.</li>
<li class=""><strong>Video metadata extraction</strong>: How to extract video statistics, title, description, publish date, and other metadata from video pages.</li>
<li class=""><strong>Transcript access</strong>: How to obtain the correct transcript link.</li>
<li class=""><strong>Data formats</strong>: Transcript data is available in XML format, which is easier to parse than JSON3</li>
<li class=""><strong>Regional considerations</strong>: Special handling required for GDPR consent in European countries</li>
</ul>
<p>With this knowledge, we're ready to implement the YouTube scraper using Crawlee for Python.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-configuring-crawlee">3. Configuring Crawlee<a href="https://crawlee.dev/blog/scrape-youtube-python#3-configuring-crawlee" class="hash-link" aria-label="Direct link to 3. Configuring Crawlee" title="Direct link to 3. Configuring Crawlee" translate="no">​</a></h2>
<p>Configuring Crawlee for YouTube is very similar to configuring it for <a href="https://www.crawlee.dev/blog/scrape-tiktok-python" target="_blank" rel="noopener noreferrer">TikTok</a>, but with some key differences.</p>
<p>Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a <code>max_items</code> parameter that will limit the maximum number of elements for each search, and pass it in <code>user_data</code> when forming a <a href="https://www.crawlee.dev/python/api/class/Request" target="_blank" rel="noopener noreferrer">Request</a>.</p>
<p>We'll limit the intensity of scraping by setting <code>max_tasks_per_minute</code> in <a href="https://www.crawlee.dev/python/api/class/ConcurrencySettings" target="_blank" rel="noopener noreferrer"><code>ConcurrencySettings</code></a>. This will help us reduce the likelihood of being blocked by YouTube.</p>
<p>Scrolling pages can take a long time, so we’ll increase the time limit for processing a single request using <code>request_handler_timeout</code>.</p>
<p>Since we won't be saving images, videos, and similar media content during crawling, we can block requests to them using <a href="https://www.crawlee.dev/python/api/class/BlockRequestsFunction" target="_blank" rel="noopener noreferrer"><code>block_requests</code></a> and <a href="https://www.crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook" target="_blank" rel="noopener noreferrer"><code>pre_navigation_hook</code></a>.</p>
<p>Also, to handle the <code>GDPR</code> page only once, we'll use <a href="https://www.crawlee.dev/python/api/class/UseStateFunction" target="_blank" rel="noopener noreferrer"><code>use_state</code></a> to pass the appropriate cookies between sessions, ensuring all requests have the necessary cookies.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> apify </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Actor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">hooks </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> pre_hook</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Create a crawler instance with the router</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Limit scraping intensity by setting a limit on requests per minute</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># We'll configure the `router` in the next step</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Increase the timeout for the request handling pipeline</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">seconds</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">120</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Runs browser without visual interface</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Limit requests per crawl for testing purposes</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Set the maximum number of items to scrape per youtube channel</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_items </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Set the list of channels to scrape</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        channels </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'Apify'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Set hook for prepare context before navigation on each request</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">pre_navigation_hook</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">pre_hook</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'https://www.youtube.com/@</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">channel</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/videos'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'limit'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> max_items</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> channel </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> channels</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Let's prepare the <code>pre_hook</code> function to block requests and set cookies (the cookie collection process will be explained in the extraction section):</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># hooks.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightPreNavCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">pre_hook</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightPreNavCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Prepare context before navigation."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler_state </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">use_state</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Check if there are previously collected cookies in the crawler state and set them for the session</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'cookies'</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> crawler_state </span><span class="token keyword" style="color:#00009f">and</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">session</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        cookies </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> crawler_state</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cookies'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Set cookies for the session</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">session</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">cookies</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">set_cookies_from_playwright_format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">cookies</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Block requests to resources that aren't needed for parsing</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># This is similar to the default value, but we don't block `css` as it is needed for Player loading</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">block_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        url_patterns</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'.webp'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.jpg'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.jpeg'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.png'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.svg'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.gif'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.woff'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.pdf'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'.zip'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-extracting-youtube-data">4. Extracting YouTube data<a href="https://crawlee.dev/blog/scrape-youtube-python#4-extracting-youtube-data" class="hash-link" aria-label="Direct link to 4. Extracting YouTube data" title="Direct link to 4. Extracting YouTube data" translate="no">​</a></h2>
<p>After configuration, let's move on to navigation and data extraction.</p>
<p>For infinite scrolling, we'll use the built-in helper function <a href="https://www.crawlee.dev/python/api/class/PlaywrightCrawlingContext#infinite_scroll" target="_blank" rel="noopener noreferrer">'infinite_scroll'</a>. But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's <code>asyncio</code> capabilities to make it a background task.</p>
<p>The <code>GDPR</code> page requiring consent for cookie usage is on the domain <code>consent.youtube.com</code>, which might cause an error when forming a <a href="https://www.crawlee.dev/python/api/class/Request" target="_blank" rel="noopener noreferrer">Request</a> for a video page. Therefore, we need to use a helper function for the <code>transform_request_function</code> parameter in <a href="https://www.crawlee.dev/python/api/class/ExtractLinksFunction" target="_blank" rel="noopener noreferrer"><code>extract_links</code></a>.</p>
<p>This function will check each extracted URL. If it contains 'consent.youtube', we'll replace it with '<a href="http://www.youtube/" target="_blank" rel="noopener noreferrer">www.youtube</a>'.<!-- --> This will allow us to get the correct URL for the video page.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> __future__ </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> annotations</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> xml</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">etree</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">ElementTree </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> ET</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> typing </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> TYPE_CHECKING</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> yarl </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> URL</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> RequestOptions</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> RequestTransformAction</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> TYPE_CHECKING</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> playwright</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">async_api </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> PlaywrightRequest</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> playwright</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">async_api </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Route </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> PlaywrightRoute</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_domain_transform</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">request_param</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> RequestOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> RequestOptions </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> RequestTransformAction</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Transform request before adding it to the queue."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'consent.youtube'</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> request_param</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_param</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> request_param</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">replace</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'consent.youtube'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'www.youtube'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> request_param</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'unchanged'</span><br></div></code></pre></div></div>
<p>Let's implement a function that will intercept transcript requests for later modification and processing in the crawler:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">extract_transcript_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Extract the transcript URL from request intercepted by Playwright."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a Future to store the transcript URL</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    transcript_future</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Future</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Future</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Define a handler for the transcript request</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># This will be called when the page requests the transcript</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handle_transcript_request</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">route</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightRoute</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightRequest</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Set the result of the future with the transcript URL</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> transcript_future</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">done</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            transcript_future</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">set_result</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> route</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">fulfill</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">status</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">200</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Set up a route to intercept requests to the transcript API</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">route</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'**/api/timedtext**'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> handle_transcript_request</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Click the subtitles button to trigger the transcript request</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">click</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'.ytp-subtitles-button'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Wait for the transcript URL to be captured</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># The future will resolve when handle_transcript_request is called</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> transcript_future</span><br></div></code></pre></div></div>
<p>Now, let's create the main handler that will navigate to the channel page, perform infinite scrolling, and extract links to videos.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle requests that do not match any specific handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Get the limit from user_data, default to 10 if not set</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    limit </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'limit'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token builtin">isinstance</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">limit</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">int</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> TypeError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Limit must be an integer'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Wait for the page to load</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'h1'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">first</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">state</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'attached'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Check if there's a GDPR popup on the page requiring consent for cookie usage</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    cookies_button </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'button[aria-label*="Accept"]'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">first</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> cookies_button</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">is_visible</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> cookies_button</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">click</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Save cookies for later use with other sessions</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># You can learn more about `SOCS` cookies from - https://policies.google.com/technologies/cookies?hl=en-US</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        cookies_state </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">cookie </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> cookie </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">cookies</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> cookie</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'name'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'SOCS'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler_state </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">use_state</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler_state</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cookies'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> cookies_state</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Wait until at least one video loads</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'a[href*="watch"]'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">first</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a background task for infinite scrolling</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    scroll_task</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Task</span><span class="token punctuation" style="color:#393A34">[</span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create_task</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">infinite_scroll</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Scroll the page to the end until we reach the limit or finish scrolling</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">while</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> scroll_task</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">done</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract links to videos</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">extract_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            selector</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'a[href*="watch"]'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'video'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            transform_request_function</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">request_domain_transform</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            strategy</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'same-domain'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Create a dictionary to avoid duplicates</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests_map </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">id</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> request </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> request </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># If the limit is reached, cancel the scrolling task</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token builtin">len</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests_map</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&gt;=</span><span class="token plain"> limit</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            scroll_task</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">cancel</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">break</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Switch the asynchronous context to allow other tasks to execute</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sleep</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0.2</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># If the scroll task is done, we can safely assume that we have reached the end of the page</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">extract_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            selector</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'a[href*="watch"]'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'video'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            transform_request_function</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">request_domain_transform</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            strategy</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'same-domain'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests_map </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">id</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> request </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> request </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests_map</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">values</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">limit</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Add the requests to the queue</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Let's take a closer look at the parameters used in <a href="https://www.crawlee.dev/python/api/class/ExtractLinksFunction#Methods" target="_blank" rel="noopener noreferrer"><code>extract_links</code></a>:</p>
<ul>
<li class=""><code>selector</code> - selector for extracting links to videos. We expected that we could use <code>id="video-title-link"</code>, but YouTube uses different page formats with different selectors, so the selector <code>a[href*="watch"]</code> will be more universal.</li>
<li class=""><code>label</code> - pointer for the router that will be used to handle the video page.</li>
<li class=""><code>transform_request_function</code> - function to transform the request before adding it to the queue. We use it to replace the domain <code>consent.youtube</code> with <code>www.youtube</code>, which helps avoid errors when processing the video page.</li>
<li class=""><code>strategy</code> - strategy for extracting links. We use <code>same-domain</code> to extract links to any subdomain of <code>youtube.com</code>.</li>
</ul>
<p>Let's move on to the handler for video pages. In it, we'll extract video data and also look at how to get and process the video transcript link.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'video'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">video_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle video requests."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing video </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># extract video data from the page</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    video_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">evaluate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'window.ytInitialPlayerResponse'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    main_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'shortDescription'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'channel'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'channel_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'channelId'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'video_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoId'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'duration'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'lengthSeconds'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'keywords'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'keywords'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'view_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'viewCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'like_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'microformat'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'playerMicroformatRenderer'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'likeCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'is_shorts'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'microformat'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'playerMicroformatRenderer'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'isShortsEligible'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'publish_date'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'microformat'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'playerMicroformatRenderer'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'publishDate'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Try to extract the transcript URL</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        transcript_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">extract_transcript_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">20</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">TimeoutError</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        transcript_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> transcript_url</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        transcript_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">URL</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">transcript_url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">without_query_params</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'fmt'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Found transcript URL: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">transcript_url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">transcript_url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'transcript'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'video_data'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> main_data</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main_data</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Note that if we want to extract the video transcript, we need to get the link to the transcript file and pass the video data to the next handler before it's saved to the <a href="https://www.crawlee.dev/python/api/class/Dataset" target="_blank" rel="noopener noreferrer"><code>Dataset</code></a>.</p>
<p>The final stage is processing the transcript. YouTube uses <a href="https://www.w3schools.com/xml/" target="_blank" rel="noopener noreferrer">XML</a> to transmit transcript data, so we need to use a library to parse XML, such as <a href="https://docs.python.org/3/library/xml.etree.elementtree.html" target="_blank" rel="noopener noreferrer"><code>xml.etree.ElementTree</code></a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'transcript'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">transcript_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle transcript requests."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing transcript </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Get the main video data extracted in `video_handler`</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    video_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'video_data'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Get XML data from the response</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        root </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ET</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">fromstring</span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract text elements from XML</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        transcript_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">text_element</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> text_element </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> root</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">findall</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'.//text'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> text_element</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Enrich video data by adding the transcript</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        video_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'transcript'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'\n'</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">transcript_data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Save the data to Dataset</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">video_data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> ET</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">ParseError</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">warning</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Incorect XML Response'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Save the video data without the transcript</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">video_data</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>After collecting the data, we need to save the results to a file. Just add the following code to the end of the <code>main</code> function in <code>main.py</code>:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Export the data from Dataset to JSON format</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'youtube.json'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>To run the crawler, use the command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uv run python </span><span class="token parameter variable" style="color:#36acaa">-m</span><span class="token plain"> youtube_crawlee</span><br></div></code></pre></div></div>
<p>Example result record:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"url"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.youtube.com/watch?v=r-1J94tk5Fo"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Facebook Marketplace API - Scrape Data Based on LOCATION, CATEGORY and SEARCH"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"See how you can export Facebook Marketplace listings to Excel, CSV or JSON with the Facebook Marketplace API 🛍️&nbsp;Input one or more URLs to scrape price, description, images, delivery info, seller data, location, listing status, and much more 📊\n\nWith the Facebook Marketplace Downloader, you can:\n🛒 **Extract listings and seller details** from any public Marketplace category or search query.\n📷 **Scrape product details**, including images, prices, descriptions, locations, and timestamps.\n💰 **Get thousands of marketplace listings** quickly and efficiently.\n📦 **Export results** via API or in JSON, CSV, or Excel with all listing details.\n\n🛍️ Facebook Marketplace Search API 👉&nbsp;https://apify.it/3E5NLz4\n📱&nbsp;Explore other Facebook Scrapers 👉&nbsp;https://apify.it/43Bae1f\n\n*Why scrape Facebook Marketplace data?* 🤔\n💰 Price &amp; Demand Analysis – Track product pricing trends and demand fluctuations.\n📊 Competitor Insights – Monitor listings from competitors to adjust pricing and strategy.\n📍 Location-Based Market Trends – Identify popular products in specific regions.\n🔎 Product Availability Monitoring – Detect shortages or oversupply in certain categories.\n📈 Reselling Opportunities – Find underpriced items for profitable flips.\n🛍 Consumer Behavior Insights – Understand what products and features attract buyers.\n💡 Trend Spotting – Discover emerging products before they go mainstream.\n📝 Market Research – Gather data for academic, business, or personal research.\n\n*How to* scrape *facebook marketplace? 🧑‍🏫*&nbsp;\nStep 1. Find the Facebook Marketplace dataset tool on Apify Store\nStep 2: Click ‘Try for free’\nStep 3: Input a URL\nStep 4: Fine tune the input\nStep 5: Start the Actor and get your data!\n\n*Useful links 🧑‍💻*\n📚 Read more about Scraping Facebook data: https://apify.it/43wyth9\n🧑‍💻 Sign up for Apify: https://apify.it/42e8nNu\n🧩 Integrate the Actor with other tools: https://apify.it/43Ustiz\n📱 Browse other Social Media Scrapers on Apify Store: https://apify.it/4jhq7i8\n\n*Follow us 🤳*\nhttps://www.linkedin.com/company/apifytech\nhttps://twitter.com/apify\nhttps://www.tiktok.com/@apifytech\nhttps://discord.com/invite/jyEM2PRvMU\n\n*Timestamps ⌛️*\n00:00 Introduction\n01:27 Input\n02:17 Run\n02:26 Export\n02:41 Scheduling\n02:54 Integrations\n03:00 API\n03:13 Other Meta Scrapers\n03:26 Like and subscribe!\n\n#webscraping #instagram"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"channel"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Apify"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"channel_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"UCTgwcoeGGKmZ3zzCXN2qo_A"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"video_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"r-1J94tk5Fo"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"duration"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"226"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"keywords"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web scraping platform"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web automation"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"scrapers"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Apify"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web crawling"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web scraping"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"data extraction"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"best web scraping tool"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"API"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"how to extract data from any website"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web scraping tutorial"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web scrape"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"data collection tool"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"RPA"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web integration"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"how to turn website into API"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"JSON"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"python web scraping"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web scraping python"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web api integration"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"how to turn website into api"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"scraping"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"apify"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"data extraction tools"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"how to web scrape"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web scraping javascript"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"web scraping tool"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"view_count"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"765"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"like_count"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"8"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"is_shorts"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">false</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"publish_date"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"2025-04-03T05:33:18-07:00"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"transcript"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Hi, Theo here. In this video, I’ll&nbsp;\nshow you how to scrape structured\ndata from Facebook Marketplace by location,&nbsp;\ncategory, or specific search query. You’ll\nbe able to extract listing details like price,&nbsp;\ndescription, images, delivery info, seller data,\nlocation, and listing status — using a&nbsp;\ntool called Facebook Marketplace Scraper.\nHere’s what you can do with it.&nbsp;\nIf you&amp;#39;re reselling, flipping,\nor deal hunting, scraping helps you track&nbsp;\nprices, spot trends, and catch underpriced\nor free items early. Looking for a rental&nbsp;\nor house? Compare listings across cities,\ncheck historical prices, and avoid wasting&nbsp;\ntime on overpriced options. Selling on\nMarketplace? Analyze top-performing listings,&nbsp;\noptimize keywords, and price competitively.\nFor businesses, scraping&nbsp;\nenables competitor tracking,\ndynamic pricing, real estate&nbsp;\nresearch, fraud detection,\nand brand protection — like spotting counterfeit&nbsp;\nor unauthorized listings before they do damage.\nThe best part is you don’t need to&nbsp;\njump through hoops to get this data:\nFacebook Marketplace Scraper makes things simple:&nbsp;\nno login, no cookies, no browser extension.\nIt runs in the cloud, and you can export&nbsp;\nresults in JSON, CSV, Excel — or use the API.\nLet’s see how it works.\nFirst, head to the link in the description,&nbsp;\nwhich’ll take you to Facebook Marketplace\nScraper’s README. Click on `try&nbsp;\nfor free`, which will send you to\nthe `Login page` and you can get started&nbsp;\nwith a free Apify account - don’t worry,\nthere’s no limit on the free plan and&nbsp;\nno credit card will ever be required.\nAfter logging in, you’ll land on the Actor’s&nbsp;\ninput page. While you can configure this through\neither the intuitive UI or JSON, we’ll&nbsp;\nstick with the UI option to keep it easy.\nFor scraping Facebook Marketplace, you’re gonna&nbsp;\nneed the URL from Facebook. You can use a URL of\na search term, location or an item category. For&nbsp;\nthis tutorial, we’re gonna go with an iPhone. So\nlet’s open up Facebook Marketplace, input a search&nbsp;\nterm and then copy the URL from the toolbar and\npaste it in the input. You can add more via the&nbsp;\nadd button, edit them in bulk or import the URLs\nas a text file. Next, you can limit how many&nbsp;\nposts you want to scrape. And that’s it.\nBefore running your Actor, it’s a great idea&nbsp;\nto save your configuration and create a task.\nThis will come in handy for scheduling or&nbsp;\nintegrating your Actor with other tools,\nor if you plan to work with&nbsp;\nmultiple configurations.\nNow that we have the `input`, let’s run&nbsp;\nthe Actor by hitting START. You can watch\nyour results appear in Overview or switch to&nbsp;\nthe Log tab to see more details about run.\nNow that your run is finished, we can get the&nbsp;\ndata via the Export button. You can choose your\npreffered format, and select which fields you want&nbsp;\nto include or exclude in your dataset. Then just\nhit Download and you have your dataset file. Let&nbsp;\nme show you what this looks like in JSON format.\nIf you want to automate your workflow&nbsp;\neven more, you can schedule your Facebook\nMarketplace Scraper to run at regular intervals.&nbsp;\nChoose your task and hit schedule. You can set\nthe frequency of how often you want to run&nbsp;\nthe Actor. You can even connect your Actor\nto other cloud services, such as Google&nbsp;\nDrive, Make, or any other Apify Actor.\nYou can also run this scraper locally via&nbsp;\nAPI. You can find the code in Node.js,\nPython, or curl in the API&nbsp;\ndrop down menu in the top-right\ncorner. To learn more about retrieving data&nbsp;\nprogramatically, check out our video on it.\nNeed more Facebook or Instagram data?&nbsp;\nCheck out our other scrapers in Apify\nStore. We have got dozens of meta&nbsp;\nscrapers, links are in the description.\nIf you prefer video tutorials, we have a playlist&nbsp;\ncovering different Instagram scraping use cases.\nAnd that’s all for today! Let us know what you&nbsp;\nthink about the Facebok Marketplace Scraper.\nRemember, if you come across any issues, make&nbsp;\nsure to report them to our team in Apify Console.\nIf you found this helpful, give us a thumbs&nbsp;\nup and subscribe. Don&amp;#39;t forget to hit the\nbell to stay updated on new tutorials. Thanks for&nbsp;\nwatching! So long, and thanks for all the likes"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-enhancing-the-scraper-capabilities">5. Enhancing the scraper capabilities<a href="https://crawlee.dev/blog/scrape-youtube-python#5-enhancing-the-scraper-capabilities" class="hash-link" aria-label="Direct link to 5. Enhancing the scraper capabilities" title="Direct link to 5. Enhancing the scraper capabilities" translate="no">​</a></h2>
<p>As with any project working with a large site like YouTube, you may encounter various issues that need to be resolved.
Currently, the Crawlee for Python documentation contains many guides and examples to help you with this.</p>
<ul>
<li class="">Use <a href="https://camoufox.com/" target="_blank" rel="noopener noreferrer"><code>Camoufox</code></a>, a project compatible with Playwright, which allows you to get a browser configuration that's more resistant to blocking, and you can easily <a href="https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-camoufox" target="_blank" rel="noopener noreferrer">integrate it with Crawlee for Python</a>.</li>
<li class="">Improve error handling and logging for unusual cases so you can easily debug and maintain the project; the guide on <a href="https://www.crawlee.dev/python/docs/guides/error-handling" target="_blank" rel="noopener noreferrer">error handling</a> is a good place to start.</li>
<li class="">Add proxy support to avoid blocks from YouTube. You can use <a href="https://apify.com/proxy" target="_blank" rel="noopener noreferrer">Apify Proxy</a> and <a href="https://www.crawlee.dev/python/api/class/ProxyConfiguration" target="_blank" rel="noopener noreferrer"><code>ProxyConfiguration</code></a>; you can learn more in this guide in the <a href="https://www.crawlee.dev/python/docs/guides/proxy-management#proxy-configuration" target="_blank" rel="noopener noreferrer">documentation</a>.</li>
<li class="">Make your crawler a web service that crawls pages by user request, using <a href="https://fastapi.tiangolo.com/" target="_blank" rel="noopener noreferrer">FastAPI</a> and following this <a href="https://www.crawlee.dev/python/docs/guides/running-in-web-server" target="_blank" rel="noopener noreferrer">guide</a>.</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="6-creating-youtube-actor-on-the-apify-platform">6. Creating YouTube Actor on the Apify platform<a href="https://crawlee.dev/blog/scrape-youtube-python#6-creating-youtube-actor-on-the-apify-platform" class="hash-link" aria-label="Direct link to 6. Creating YouTube Actor on the Apify platform" title="Direct link to 6. Creating YouTube Actor on the Apify platform" translate="no">​</a></h2>
<p>For deployment, we'll use the <a href="https://apify.com/" target="_blank" rel="noopener noreferrer">Apify platform</a>. It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via <a href="https://docs.apify.com/api/v2/" target="_blank" rel="noopener noreferrer">API</a>, <a href="https://docs.apify.com/platform/schedules" target="_blank" rel="noopener noreferrer">schedule tasks</a>, <a href="https://docs.apify.com/platform/integrations" target="_blank" rel="noopener noreferrer">integrate</a> with various services, and much more.</p>
<p>To deploy to the Apify platform, we need to adapt our project for the <a href="https://apify.com/actors" target="_blank" rel="noopener noreferrer">Apify Actor</a> structure.</p>
<p>Create an <code>.actor</code> folder with the necessary files.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">mkdir</span><span class="token plain"> .actor </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">touch</span><span class="token plain"> .actor/</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">actor.json,input_schema.json</span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Move the <code>Dockerfile</code> from the root folder to <code>.actor</code>.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">mv</span><span class="token plain"> Dockerfile .actor</span><br></div></code></pre></div></div>
<p>Let's fill in the empty files:</p>
<p>The <code>actor.json</code> file contains project metadata for the Apify platform. Follow the <a href="https://docs.apify.com/platform/actors/development/actor-definition/actor-json" target="_blank" rel="noopener noreferrer">documentation for proper configuration</a>:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"actorSpecification"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"YouTube-Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"YouTube - Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"minMemoryMbytes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2048</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Scrape video stats, metadata and transcripts from videos in YouTube channels"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"version"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"0.1"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"meta"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"templateId"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"youtube-crawlee"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"input"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"./input_schema.json"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"dockerfile"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"./Dockerfile"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Actor input parameters are defined using <code>input_schema.json</code>, which is specified <a href="https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1" target="_blank" rel="noopener noreferrer">here</a>.</p>
<p>Let's define input parameters for our crawler:</p>
<ul>
<li class=""><code>maxItems</code> - maximum number of videos per channel for scraping.</li>
<li class=""><code>channelNames</code> - these are the YouTube channel names to scrape.</li>
<li class=""><code>proxySettings</code> - proxy settings, since without a proxy, you'll be using the datacenter IP that Apify uses.</li>
</ul>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"YouTube Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"object"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"schemaVersion"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"properties"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"channelNames"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"List Channel Names"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"array"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Channel names for extraction video stats, metadata and transcripts."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"stringList"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"prefill"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"Apify"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"maxItems"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"integer"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"number"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Limit search results"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Limits the maximum number of results, applies to each search separately."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"default"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"proxySettings"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Proxy configuration"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"object"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Select proxies to be used by your scraper."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"prefill"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> </span><span class="token property" style="color:#36acaa">"useApifyProxy"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"proxy"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"required"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"channelNames"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Let's update the code to accept input parameters.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Get the input parameters from the Actor</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        actor_input </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_input</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        max_items </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'maxItems'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        channels </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'channelNames'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        proxy </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create_proxy_configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">actor_proxy_input</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'proxySettings'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">seconds</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">120</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            proxy_configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">proxy</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>And delete export to JSON from the <code>main</code> function, as the Apify platform will handle data storage in the <a href="https://docs.apify.com/platform/storage/dataset" target="_blank" rel="noopener noreferrer">Dataset</a>.</p>
<p>That's it, the project is ready for deployment.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="7-deploying-to-apify">7. Deploying to Apify<a href="https://crawlee.dev/blog/scrape-youtube-python#7-deploying-to-apify" class="hash-link" aria-label="Direct link to 7. Deploying to Apify" title="Direct link to 7. Deploying to Apify" translate="no">​</a></h2>
<p>Use the official <a href="https://docs.apify.com/cli/" target="_blank" rel="noopener noreferrer">Apify CLI</a> to upload your code:</p>
<p>Authenticate using your API token from <a href="https://console.apify.com/settings/integrations" target="_blank" rel="noopener noreferrer">Apify Console</a>:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify login</span><br></div></code></pre></div></div>
<p>Choose "Enter API token manually" and paste your token.</p>
<p>Push the project to the platform:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify push</span><br></div></code></pre></div></div>
<p>Now you can configure runs on the Apify platform.</p>
<p>Let's perform a test run:</p>
<p>Fill in the input parameters:</p>
<p><img decoding="async" loading="lazy" alt="Actor Input" src="https://crawlee.dev/assets/images/input_actor-6bdab40eb022bcb34ad63da770f4dcea.webp" width="847" height="777" class="img_ev3q"></p>
<p>View results in the dataset:</p>
<p><img decoding="async" loading="lazy" alt="Dataset Results" src="https://crawlee.dev/assets/images/actor_results-36a0c08c154c59a9fb3887222c5926f2.webp" width="1665" height="851" class="img_ev3q"></p>
<p>If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this <a href="https://docs.apify.com/platform/actors/publishing" target="_blank" rel="noopener noreferrer">publishing guide</a> for <a href="https://apify.com/store" target="_blank" rel="noopener noreferrer">Apify Store</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/scrape-youtube-python#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>We've created a good foundation for crawling YouTube using Crawlee for Python and Playwright. If you're just starting your journey in crawling, this will be an excellent project for learning and practice. You can use it as a basis for creating more complex crawlers that will collect data from YouTube. If this is your first project using Crawlee for Python, check out all the documentation links provided in this article; it will help you better understand how Crawlee for Python works and how you can use it for your projects.</p>
<p>You can find the complete code in the <a href="https://github.com/Mantisus/youtube-crawlee" target="_blank" rel="noopener noreferrer">repository</a></p>
<p>If you enjoyed this blog, feel free to support Crawlee for Python by starring the <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">repository</a> or joining the maintainer team.</p>
<p>Do you have questions or want to discuss the details of the implementation? Join our <a href="https://discord.com/invite/jyEM2PRvMU" target="_blank" rel="noopener noreferrer">Discord</a>—our community of 11,000+ developers is there to help.</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[How to scrape TikTok using Python]]></title>
            <link>https://crawlee.dev/blog/scrape-tiktok-python</link>
            <guid>https://crawlee.dev/blog/scrape-tiktok-python</guid>
            <pubDate>Fri, 25 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape TikTok using Crawlee for Python]]></description>
            <content:encoded><![CDATA[<p><a href="https://www.tiktok.com/" target="_blank" rel="noopener noreferrer">TikTok</a> users generate tons of data that are valuable for analysis.</p>
<p>Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">Crawlee for Python</a>.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">Discord channel</a>.</p></div></div>
<p><img decoding="async" loading="lazy" alt="How to scrape TikTok using Python" src="https://crawlee.dev/assets/images/main_image-94d608c24b2e8970cac1d9040b8290a5.webp" width="1536" height="864" class="img_ev3q"></p>
<p>Key steps we'll cover:</p>
<ol>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-tiktok-python#1-project-setup" target="_blank" rel="noopener noreferrer">Project setup</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-tiktok-python#2-analyzing-tiktok-and-determining-a-scraping-strategy" target="_blank" rel="noopener noreferrer">Analyzing TikTok and determining a scraping strategy</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-tiktok-python#3-configuring-crawlee" target="_blank" rel="noopener noreferrer">Configuring Crawlee</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-tiktok-python#4-extracting-tiktok-data" target="_blank" rel="noopener noreferrer">Extracting TikTok data</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-tiktok-python#5-creating-tiktok-actor-on-apify-platform" target="_blank" rel="noopener noreferrer">Creating TikTok Actor on the Apify platform</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/scrape-tiktok-python#6-deploying-to-apify" target="_blank" rel="noopener noreferrer">Deploying to Apify</a></li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites">Prerequisites<a href="https://crawlee.dev/blog/scrape-tiktok-python#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h2>
<ul>
<li class="">Python 3.9 or higher</li>
<li class="">Familiarity with web scraping concepts</li>
<li class="">Crawlee for Python <code>v0.6.0</code> or higher</li>
<li class=""><a href="https://docs.astral.sh/uv/" target="_blank" rel="noopener noreferrer">uv</a> <code>v0.6</code> or higher</li>
<li class="">An Apify account</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-project-setup">1. Project setup<a href="https://crawlee.dev/blog/scrape-tiktok-python#1-project-setup" class="hash-link" aria-label="Direct link to 1. Project setup" title="Direct link to 1. Project setup" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before going ahead with the project, I'd like to ask you to star Crawlee for Python on <a href="https://github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">GitHub</a>, it helps us to spread the word to fellow scraper developers.</p></div></div>
<p>In this project, we'll use uv for package management and a specific Python version will be installed through uv. Uv is a fast and modern package manager written in Rust.</p>
<p>If you don't have uv installed yet, just follow the <a href="https://docs.astral.sh/uv/getting-started/installation/" target="_blank" rel="noopener noreferrer">guide</a> or use this command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">curl</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-LsSf</span><span class="token plain"> https://astral.sh/uv/install.sh </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">sh</span><br></div></code></pre></div></div>
<p>To create the project, run:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uvx crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cli'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> create tiktok-crawlee</span><br></div></code></pre></div></div>
<p>In the <code>cli</code> menu that opens, select:</p>
<ol>
<li class=""><code>Playwright</code></li>
<li class=""><code>Httpx</code></li>
<li class=""><code>uv</code></li>
<li class="">Leave the default value - <code>https://crawlee.dev</code></li>
<li class=""><code>y</code></li>
</ol>
<p>Or, just run the command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uvx crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cli'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> create tiktok-crawlee --crawler-type playwright --http-client httpx --package-manager uv </span><span class="token parameter variable" style="color:#36acaa">--apify</span><span class="token plain"> --start-url </span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><br></div></code></pre></div></div>
<p>Creating the project may take a few minutes. After installation is complete, navigate to the project folder:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token builtin class-name">cd</span><span class="token plain"> tiktok-crawlee</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-analyzing-tiktok-and-determining-a-scraping-strategy">2. Analyzing TikTok and determining a scraping strategy<a href="https://crawlee.dev/blog/scrape-tiktok-python#2-analyzing-tiktok-and-determining-a-scraping-strategy" class="hash-link" aria-label="Direct link to 2. Analyzing TikTok and determining a scraping strategy" title="Direct link to 2. Analyzing TikTok and determining a scraping strategy" translate="no">​</a></h2>
<p>TikTok uses quite a lot of JavaScript on its site, both for displaying content and for analyzing user behavior, including detecting and blocking crawlers. Therefore, for crawling TikTok, we'll use a headless browser with <a href="https://playwright.dev/python/" target="_blank" rel="noopener noreferrer">Playwright</a>.</p>
<p>To load new elements on a user's page, TikTok uses infinite scrolling. You may already be familiar with this method from this <a href="https://www.crawlee.dev/blog/infinite-scroll-using-python" target="_blank" rel="noopener noreferrer">article</a>.</p>
<p>Let's look at what happens under the hood when we scroll a TikTok page. I recommend studying network activity in <a href="https://developer.chrome.com/docs/devtools" target="_blank" rel="noopener noreferrer">DevTools</a> to understand what requests are going to the server.</p>
<p><img decoding="async" loading="lazy" alt="Backend Network" src="https://crawlee.dev/assets/images/load_elems-b739afc4d1d682c6fa2944275e1f8a9f.webp" width="1916" height="1108" class="img_ev3q"></p>
<p>Let's examine the HTML structure to understand if navigating to elements will be difficult.</p>
<p><img decoding="async" loading="lazy" alt="Selectors" src="https://crawlee.dev/assets/images/selectors-80c3c3aa2697ef3c0f8b2422e7367d65.webp" width="1919" height="1108" class="img_ev3q"></p>
<p>Well, this looks quite simple. If using <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors" target="_blank" rel="noopener noreferrer">CSS selectors</a>, <code>[data-e2e="user-post-item"] a</code> is sufficient.</p>
<p>Let's look at what a video page response looks like to see what data we can extract.</p>
<p><img decoding="async" loading="lazy" alt="Video Response" src="https://crawlee.dev/assets/images/html_response-4344e00324cd04aa52a5a8b257d48eaf.webp" width="1916" height="989" class="img_ev3q"></p>
<p>It seems that the HTML code contains JSON with all the data we're interested in. Great!</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-configuring-crawlee">3. Configuring Crawlee<a href="https://crawlee.dev/blog/scrape-tiktok-python#3-configuring-crawlee" class="hash-link" aria-label="Direct link to 3. Configuring Crawlee" title="Direct link to 3. Configuring Crawlee" translate="no">​</a></h2>
<p>Now that we understand our scraping strategy, let's set up Crawlee for scraping TikTok.</p>
<p>Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a <code>max_items</code> parameter that will limit the maximum number of elements for each search and pass it in <code>user_data</code> when forming a <a href="https://www.crawlee.dev/python/api/class/Request" target="_blank" rel="noopener noreferrer">Request</a>.</p>
<p>We'll limit the intensity of scraping by setting <code>max_tasks_per_minute</code> in <a href="https://www.crawlee.dev/python/api/class/ConcurrencySettings" target="_blank" rel="noopener noreferrer"><code>ConcurrencySettings</code></a>. This will help us reduce the likelihood of being blocked by TikTok.</p>
<p>We'll set <code>browser_type</code> to <code>firefox</code>, as it performed better for TikTok in my tests.
TikTok may request permissions to access device data, so we'll explicitly limit all <a href="https://playwright.dev/python/docs/api/class-browser#browser-new-context-option-permissions" target="_blank" rel="noopener noreferrer">permissions</a> by passing the appropriate parameter to <code>browser_new_context_options</code>.</p>
<p>Scrolling pages can take a long time, so we should increase the time limit for processing a single request using <code>request_handler_timeout</code>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> apify </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Actor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># When creating the template, we confirmed Apify integration.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># However, this isn't important for us at this stage.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_items </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">20</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Create a crawler with the necessary settings</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Limit scraping intensity by setting a limit on requests per minute</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># We'll configure the `router` in the next step</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># You can use `False` during development. But for production, it's always `True`</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Increase the timeout for the request handling pipeline</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">seconds</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">120</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            browser_type</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'firefox'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Limit any permissions to device data</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            browser_new_context_options</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'permissions'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler to collect data from several user pages</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'https://www.tiktok.com/@apifyoffice'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'limit'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> max_items</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'https://www.tiktok.com/@authorbrandonsanderson'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'limit'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> max_items</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Someone might ask, "What about configurations to avoid fingerprint blocking?!!!" My answer is, "Crawlee for Python has already done that for you."</p>
<p>Depending on your deployment environment, you may need to add a proxy. We'll come back to this in the last section.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-extracting-tiktok-data">4. Extracting TikTok data<a href="https://crawlee.dev/blog/scrape-tiktok-python#4-extracting-tiktok-data" class="hash-link" aria-label="Direct link to 4. Extracting TikTok data" title="Direct link to 4. Extracting TikTok data" translate="no">​</a></h2>
<p>After configuration, let's move on to navigation and data extraction.</p>
<p>For infinite scrolling, we'll use the built-in helper function <a href="https://www.crawlee.dev/python/api/class/PlaywrightCrawlingContext#infinite_scroll" target="_blank" rel="noopener noreferrer">'infinite_scroll'</a>. But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's <code>asyncio</code> capabilities to make it a background task.</p>
<p>Also, with deeper investigation, you may encounter a TikTok page that doesn't load user videos, but only shows a button and an error message.</p>
<p><img decoding="async" loading="lazy" alt="Error Page" src="https://crawlee.dev/assets/images/went_wrong-413878d9f5a4331add12544c0a25ccd7.webp" width="991" height="1069" class="img_ev3q"></p>
<p>It's very important to handle this case.</p>
<p>Also during testing, I discovered that you need to interact with scrolling, otherwise when using <code>infinite_scroll</code>, new elements don't load. I think this is a TikTok bug.</p>
<p>Let's start with a simple function to extract video links. It will help avoid code duplication.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> json</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> playwright</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">async_api </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Page</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Helper function that extracts all loaded video links</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">extract_video_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Extract all loaded video links from the page."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    links </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> post </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector_all</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'[data-e2e="user-post-item"] a'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        post_link </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_attribute</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'href'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> post_link </span><span class="token keyword" style="color:#00009f">and</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'/video/'</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> post_link</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            links</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">append</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">post_link</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'video'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> links</span><br></div></code></pre></div></div>
<p>Now we can move on to the main handler that will process TikTok user pages.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Main handler used for TikTok user pages</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle request without specific label."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Get the limit for video elements from `user_data`</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    limit </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'limit'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token builtin">isinstance</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">limit</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">int</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> TypeError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Limit must be an integer'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Wait until the button or at least a video loads, if the connection is slow</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    check_locator </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'[data-e2e="user-post-item"], main button'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">first</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> check_locator</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># If the button loaded, click it to initiate video loading</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> button </span><span class="token operator" style="color:#393A34">:=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'main button'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> button</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">click</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Perform interaction with scrolling</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">press</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'body'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'PageDown'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Start `infinite_scroll` as a background task</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    scroll_task</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Task</span><span class="token punctuation" style="color:#393A34">[</span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create_task</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">infinite_scroll</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Wait until scrolling is completed or until the limit is reached</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">while</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> scroll_task</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">done</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> extract_video_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># If we've already reached the limit, interrupt scrolling and exit the loop</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token builtin">len</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&gt;=</span><span class="token plain"> limit</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            scroll_task</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">cancel</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">break</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Switch the asynchronous context to allow other tasks to execute</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sleep</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0.2</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> extract_video_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Limit the number of requests to the limit value</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">limit</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># If the page wasn't properly processed for some reason and didn't find any links,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># then I want to raise an error for retry</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> RuntimeError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'No video links found'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The final stage is handling the video page.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'video'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">video_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle request with the label 'video'."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing video </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Extract the element containing JSON with data</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    json_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'#__UNIVERSAL_DATA_FOR_REHYDRATION__'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> json_element</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract JSON and convert it to a dictionary</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        text_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> json_element</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        json_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">text_data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'__DEFAULT_SCOPE__'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'webapp.video-detail'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'itemInfo'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'itemStruct'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Create result item</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        result_item </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'nickname'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'nickname'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'id'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'handle'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'uniqueId'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'signature'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'signature'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'followers'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'authorStats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'followerCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'following'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'authorStats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'followingCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'hearts'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'authorStats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'heart'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'videos'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'authorStats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'videoCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'desc'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'tags'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">item</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'hashtagName'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'textExtra'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'hashtagName'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'hearts'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'stats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'diggCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'shares'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'stats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'shareCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'comments'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'stats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'commentCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'plays'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'stats'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'playCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Save the result to the dataset</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">result_item</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># If the data wasn't received, we raise an error for retry</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> RuntimeError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'No JSON data found'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The crawler is ready for local launch. To run it, execute the command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uv run python </span><span class="token parameter variable" style="color:#36acaa">-m</span><span class="token plain"> tiktok_crawlee</span><br></div></code></pre></div></div>
<p>You can view the saved results in the <code>dataset</code> folder, path <code>./storage/datasets/default/</code>.</p>
<p>Example record:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"author"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"nickname"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"apifyoffice"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"7095709566285480965"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"handle"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"apifyoffice"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"signature"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"🤖 web scraping and AI 🤖\n\ncheck out our open positions at ✨apify.it/jobs✨"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"followers"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">118</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"following"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">3</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"hearts"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1975</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"videos"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">33</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">""</span><span class="token plain">Fun</span><span class="token string" style="color:#e3116c">" is the top word Apifiers used to describe our culture. Here's what else came to their minds 🎤  #workculture #teambuilding #interview #czech #ilovemyjob "</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"tags"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"workculture"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"teambuilding"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"interview"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"czech"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"ilovemyjob"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"hearts"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">7</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"shares"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"comments"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"plays"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">448</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-creating-tiktok-actor-on-the-apify-platform">5. Creating TikTok Actor on the <a href="https://apify.com/" target="_blank" rel="noopener noreferrer">Apify platform</a><a href="https://crawlee.dev/blog/scrape-tiktok-python#5-creating-tiktok-actor-on-the-apify-platform" class="hash-link" aria-label="Direct link to 5-creating-tiktok-actor-on-the-apify-platform" title="Direct link to 5-creating-tiktok-actor-on-the-apify-platform" translate="no">​</a></h2>
<p>For deployment, we'll use the <a href="https://apify.com/" target="_blank" rel="noopener noreferrer">Apify platform</a>. It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via <a href="https://docs.apify.com/api/v2/" target="_blank" rel="noopener noreferrer">API</a>, <a href="https://docs.apify.com/platform/schedules" target="_blank" rel="noopener noreferrer">schedule tasks</a>, <a href="https://docs.apify.com/platform/integrations" target="_blank" rel="noopener noreferrer">integrate</a> with various services, and much more.</p>
<p>To deploy to the Apify platform, we need to adapt our project for the <a href="https://apify.com/actors" target="_blank" rel="noopener noreferrer">Apify Actor</a> structure.</p>
<p>Create an <code>.actor</code> folder with the necessary files.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">mkdir</span><span class="token plain"> .actor </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">touch</span><span class="token plain"> .actor/</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">actor.json,input_schema.json</span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Move the <code>Dockerfile</code> from the root folder to <code>.actor</code>.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">mv</span><span class="token plain"> Dockerfile .actor</span><br></div></code></pre></div></div>
<p>Let's fill in the empty files:</p>
<p>The <code>actor.json</code> file contains project metadata for the Apify platform. Follow the <a href="https://docs.apify.com/platform/actors/development/actor-definition/actor-json" target="_blank" rel="noopener noreferrer">documentation for proper configuration</a>:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"actorSpecification"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"TikTok-Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"TikTok - Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"minMemoryMbytes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2048</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Scrape video elements from TikTok user pages"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"version"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"0.1"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"meta"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"templateId"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"tiktok-crawlee"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"input"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"./input_schema.json"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"dockerfile"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"./Dockerfile"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Actor input parameters are defined using <code>input_schema.json</code>, which is specified <a href="https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1" target="_blank" rel="noopener noreferrer">here</a>.</p>
<p>Let's define input parameters for our crawler:</p>
<ul>
<li class=""><code>maxItems</code> - this should be an externally configurable parameter.</li>
<li class=""><code>urls</code> - these are links to TikTok user pages, the starting points for our crawler's scraping</li>
<li class=""><code>proxySettings</code> - proxy settings, since without a proxy you'll be using the datacenter IP that Apify uses.</li>
</ul>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"TikTok Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"object"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"schemaVersion"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"properties"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"urls"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"List URLs"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"array"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Direct URLs to pages TikTok profiles."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"stringList"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"prefill"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"https://www.tiktok.com/@apifyoffice"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"maxItems"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"integer"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"number"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Limit search results"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Limits the maximum number of results, applies to each search separately."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"default"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"proxySettings"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Proxy configuration"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"object"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Select proxies to be used by your scraper."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"prefill"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> </span><span class="token property" style="color:#36acaa">"useApifyProxy"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"proxy"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"required"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"urls"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Let's update the code to accept input parameters.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> apify </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Actor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Accept input parameters passed when starting the Actor</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        actor_input </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_input</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        max_items </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'maxItems'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'limit'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> max_items</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> url </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'urls'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token plain">        proxy </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create_proxy_configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">actor_proxy_input</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'proxySettings'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            proxy_configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">proxy</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">seconds</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">120</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            browser_type</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'firefox'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            browser_new_context_options</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'permissions'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>That's it, the project is ready for deployment.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="6-deploying-to-apify">6. Deploying to Apify<a href="https://crawlee.dev/blog/scrape-tiktok-python#6-deploying-to-apify" class="hash-link" aria-label="Direct link to 6. Deploying to Apify" title="Direct link to 6. Deploying to Apify" translate="no">​</a></h2>
<p>Use the official <a href="https://docs.apify.com/cli/" target="_blank" rel="noopener noreferrer">Apify CLI</a> to upload your code:</p>
<p>Authenticate using your API token from <a href="https://console.apify.com/settings/integrations" target="_blank" rel="noopener noreferrer">Apify Console</a>:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify login</span><br></div></code></pre></div></div>
<p>Choose "Enter API token manually" and paste your token.</p>
<p>Push the project to the platform:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify push</span><br></div></code></pre></div></div>
<p>Now you can configure runs on the Apify platform.</p>
<p>Let's perform a test run:</p>
<p>Fill in the input parameters:</p>
<p><img decoding="async" loading="lazy" alt="Actor Input" src="https://crawlee.dev/assets/images/input_actor-33501c94f9a90c5e28c272016a7d5ec9.webp" width="831" height="809" class="img_ev3q"></p>
<p>Check that logging works correctly:</p>
<p><img decoding="async" loading="lazy" alt="Actor Log" src="https://crawlee.dev/assets/images/actor_log-4301af07fb3f21631f98802876e6b3f5.webp" width="1576" height="886" class="img_ev3q"></p>
<p>View results in the dataset:</p>
<p><img decoding="async" loading="lazy" alt="Dataset Results" src="https://crawlee.dev/assets/images/actor_results-7ab9904db12130be0317320c43070b71.webp" width="1658" height="873" class="img_ev3q"></p>
<p>If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this <a href="https://docs.apify.com/platform/actors/publishing" target="_blank" rel="noopener noreferrer">publishing guide</a> for <a href="https://apify.com/store" target="_blank" rel="noopener noreferrer">Apify Store</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/scrape-tiktok-python#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>We've created a good foundation for crawling TikTok using Crawlee for Python and Playwright. If you want to improve the project, I would recommend adding error handling and handling cases when you get a CAPTCHA to reduce the likelihood of being blocked by TikTok. However, this is a good foundation to start working with TikTok. It allows you to get data right now.</p>
<p>You can find the complete code in the <a href="https://github.com/Mantisus/tiktok-crawlee" target="_blank" rel="noopener noreferrer">repository</a></p>
<p>If you enjoyed this blog, feel free to support Crawlee for Python by starring the <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">repository</a> or joining the maintainer team.</p>
<p>Have questions or want to discuss implementation details? Join our <a href="https://discord.com/invite/jyEM2PRvMU" target="_blank" rel="noopener noreferrer">Discord</a> - our community of 10,000+ developers is there to help.</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[How to build a price tracker with Crawlee and Apify]]></title>
            <link>https://crawlee.dev/blog/crawlee-python-price-tracker</link>
            <guid>https://crawlee.dev/blog/crawlee-python-price-tracker</guid>
            <pubDate>Tue, 08 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to build and deploy a price tracker using Crawlee for Python and Apify.]]></description>
            <content:encoded><![CDATA[<p>Build a price tracker with Crawlee for Python to scrape product details, export data in multiple formats, and send email alerts for price drops, then deploy and schedule it as an Apify Actor.</p>
<p><img decoding="async" loading="lazy" alt="Crawlee for Python Price Tracker" src="https://crawlee.dev/assets/images/crawlee-python-price-tracker-8ffc0121eee82024852513938dd525ab.webp" width="1152" height="649" class="img_ev3q"></p>
<p>In this tutorial, we’ll build a price tracker using Crawlee for Python and Apify. By the end, you’ll have an Apify Actor that scrapes product details from a webpage, exports the data in various formats (CSV, Excel, JSON, and more), and sends an email alert when the product’s price falls below your specified threshold.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-project-setup">1. Project Setup<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#1-project-setup" class="hash-link" aria-label="Direct link to 1. Project Setup" title="Direct link to 1. Project Setup" translate="no">​</a></h2>
<p>Our first step is to install the <a href="https://docs.apify.com/cli/docs" target="_blank" rel="noopener noreferrer">Apify CLI</a>. You can do this using either Homebrew or NPM with the following commands:
s</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="homebrew">Homebrew<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#homebrew" class="hash-link" aria-label="Direct link to Homebrew" title="Direct link to Homebrew" translate="no">​</a></h3>
<div class="language-Bash language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">brew </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> apify-cli</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="via-npm">Via NPM<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#via-npm" class="hash-link" aria-label="Direct link to Via NPM" title="Direct link to Via NPM" translate="no">​</a></h3>
<div class="language-Bash language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">npm</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-g</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> apify-cli</span><br></div></code></pre></div></div>
<p>Next, let’s run the following commands to use one of Apify’s pre-built templates. This will streamline the setup process and get us coding right away:</p>
<div class="language-Bash language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify create price-tracking-actor</span><br></div></code></pre></div></div>
<p>A dropdown list will appear. To follow along with this tutorial, select <strong><code>Python</code></strong> and <strong><code>Crawlee + BeautifulSoup</code></strong> <code>template</code>. Once the template is installed, navigate to the newly created folder and open it in your preferred IDE.</p>
<p><img decoding="async" loading="lazy" alt="actor-templates" src="https://crawlee.dev/assets/images/actor-templates-88fa253dabe612261cb2fe95430c4c04.webp" width="903" height="230" class="img_ev3q"></p>
<p>Navigate to <strong><code>src/main.py</code></strong> in your project, and you’ll find that a significant amount of boilerplate code has already been generated for you. If you’re new to Apify or Crawlee, don’t worry, it’s not as complex as it might seem. This pre-written code is designed to save you time and jumpstart your development process.</p>
<p><img decoding="async" loading="lazy" alt="crawlee-bs4-template" src="https://crawlee.dev/assets/images/crawlee-bs4-template-528a9eee4ab1c859feb2ed42e3328045.webp" width="1163" height="391" class="img_ev3q"></p>
<p>In fact, this template comes with fully functional code that scrapes the Apify homepage. To test it out, simply run the command <strong><code>apify run</code></strong>. Within a few seconds, you’ll see the <strong><code>storage/datasets</code></strong> directory populate with the scraped data in JSON format.</p>
<p><img decoding="async" loading="lazy" alt="json-data" src="https://crawlee.dev/assets/images/json-data-9ec19a8958775e66dcd094d0d46faa90.webp" width="872" height="546" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-customizing-the-template">2. Customizing the template<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#2-customizing-the-template" class="hash-link" aria-label="Direct link to 2. Customizing the template" title="Direct link to 2. Customizing the template" translate="no">​</a></h2>
<p>Now that our project is set up, let’s customize the template to scrape our target website: <a href="https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html" target="_blank" rel="noopener noreferrer">Raspberry Pi 5 (8GB RAM) on Central Computer</a>.</p>
<p>First, on the <code>src/main.py</code> file, go to the <code>crawler.run(start_urls)</code> and replace it with the URL for the target website, as shown below:</p>
<div class="language-Python language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Normally, you could let users specify a URL through the Actor input, and the Actor would prioritize it. However, since we’re scraping a specific page, we’ll just use the hardcoded URL for simplicity. Keep in mind that dynamic input is still an option if you want to make the Actor more flexible later.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="extracting-the-products-name-and-price">Extracting the Product’s Name and Price<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#extracting-the-products-name-and-price" class="hash-link" aria-label="Direct link to Extracting the Product’s Name and Price" title="Direct link to Extracting the Product’s Name and Price" translate="no">​</a></h3>
<p>Finally, let’s modify our template to extract key elements from the page, such as the product name and price.</p>
<p>Starting with the <strong>product name</strong>, inspect the <a href="https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html" target="_blank" rel="noopener noreferrer">target page</a> using DevTools to find suitable selectors for targeting the element.</p>
<p><img decoding="async" loading="lazy" alt="product-name" src="https://crawlee.dev/assets/images/product-name-dbaba09d2d06b4b8a6b9a340698739af.webp" width="3014" height="1804" class="img_ev3q"></p>
<p>Next, create a <code>product_name_element</code> variable to hold the element selected with the CSS selectors found on the page and update the <code>data</code> dictionary with the element’s text contents. Also, remove the line of code that previously made the Actor crawl the Apify website, as we now want it to scrape only a single page.</p>
<p>Your <code>request_handler</code> function should look similar to the example below:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BeautifulSoupCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Scraping </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Select the product name and price elements.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    product_name_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'div'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> class_</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'productname'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Extract the desired data.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'product_name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> product_name_element</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> product_name_element </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Store the extracted data to the default dataset.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Enqueue additional links found on the current page.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># await context.enqueue_links() -&gt; REMOVE THIS LINE</span><br></div></code></pre></div></div>
<p>It’s a good practice to test our code after every significant change to ensure it works as expected.</p>
<p>Run <code>apify run</code> again, but this time, add the <code>–-purge</code> flag to prevent the newly scraped data from mixing with previous runs:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify run </span><span class="token parameter variable" style="color:#36acaa">--purge</span><br></div></code></pre></div></div>
<p>Navigate to <code>storage/datasets</code>, and you should find a file with the scraped content:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"url"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"product_name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Raspberry Pi 5 8GB RAM Board"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Now that you’ve got the hang of it, let’s do the same thing for the price: <code>79.99</code>.</p>
<p><img decoding="async" loading="lazy" alt="product_price.png" src="https://crawlee.dev/assets/images/product-price-fa3ab906b4a95258251defe78c19b6d3.webp" width="3024" height="1802" class="img_ev3q"></p>
<p>In the code below, you’ll notice a slight difference: instead of extracting the element’s text content, we’re retrieving the value of its <code>data-price-amount</code> attribute. This approach avoids capturing the dollar sign <code>($)</code> that would otherwise come with the text.</p>
<p>If you prefer working with text content instead, that’s perfectly fine, you can simply use <code>.replace('$', '')</code> to remove the dollar sign.</p>
<p>Also, keep in mind that the extracted price will be a <code>string</code> by default. To perform numerical comparisons, we need to convert it to a <code>float</code>. This conversion will allow us to accurately compare the price values later on.</p>
<p>Here’s how the updated code looks so far:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># ...previous code</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BeautifulSoupCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Scraping </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Select the product name and price elements.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    product_name_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'div'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> class_</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'productname'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    product_price_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'span'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">id</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'product-price-395001'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Extract the desired data.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">       </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'product_name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> product_name_element</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> product_name_element </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'price'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">float</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">product_price_element</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'data-price-amount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> product_price_element </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Store the extracted data to the default dataset.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Again, try running it with <code>apify run --purge</code> and check if you get a similar output as the example below:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"url"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"product_name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Raspberry Pi 5 8GB RAM Board"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"price"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">79.99</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>That’s it for the extraction part! Below is the complete code we’ve written so far.</p>
<blockquote>
<p>💡 <strong>TIP:</strong> If you’d like to get some more practice, try scraping additional elements such as the <strong><code>model</code></strong>, <strong><code>Item #</code></strong>, or <strong><code>stock availability (In stock)</code></strong>.</p>
</blockquote>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> apify </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Actor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BeautifulSoupCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> BeautifulSoupCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Enter the context of the Actor.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Create a crawler.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> BeautifulSoupCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Limit the crawl to max requests. Remove or increase it for crawling all links.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Define a request handler, which will be called for every request.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BeautifulSoupCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Scraping </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Select the product name and price elements.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            product_name_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'div'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> class_</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'productname'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            product_price_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'span'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">id</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'product-price-395001'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Extract the desired data.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'product_name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> product_name_element</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> product_name_element </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'price'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">float</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">product_price_element</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'data-price-amount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> product_price_element </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Store the extracted data to the default dataset.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the starting requests.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-sending-an-email-alert">3. Sending an Email Alert<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#3-sending-an-email-alert" class="hash-link" aria-label="Direct link to 3. Sending an Email Alert" title="Direct link to 3. Sending an Email Alert" translate="no">​</a></h2>
<p>From this point forward, you’ll need an <strong>Apify account</strong>. You can create one for free <a href="https://console.apify.com/sign-up" target="_blank" rel="noopener noreferrer">here</a>.</p>
<p>We need an Apify account because we’ll be making an API call to a pre-existing Actor from the <strong>Apify Store,</strong> the “Send Email Actor”, to handle notifications. Apify’s email system takes care of sending alerts, so we don’t have to worry about handling <strong>2FA</strong> in our automation.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># ...previous code</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Define a price threshold</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">price_threshold </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">80</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Call the "Send Email" Actor when the price goes below the threshold            </span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'price'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;</span><span class="token plain"> price_threshold</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    actor_run </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">start</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        actor_id</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"apify/send-mail"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        run_input</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"to"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"your_email@email.com"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"subject"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Python Price Alert"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"text"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"The price of '</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'product_name'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">' has dropped below $</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">price_threshold</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> and is now $</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'price'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">.\n\nCheck it out here: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'url'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Email sent with run ID: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">actor_run</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation builtin">id</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>In the code above, we’re using the <strong>Apify Python SDK</strong>, which is already included in our project, to call the “Send Email” Actor with the required input.</p>
<p>To make this API call work, you’ll need to log in to your Apify account from the terminal using your <strong><code>APIFY_API_TOKEN</code></strong>.</p>
<p>To get your <strong><code>APIFY_API_TOKEN</code></strong>, sign up for an Apify account, then navigate to <strong>Settings → API &amp; Integrations</strong>, and copy your <strong>Personal API token</strong>.</p>
<p><img decoding="async" loading="lazy" alt="apify-api-token" src="https://crawlee.dev/assets/images/apify-api-token-eb76078df32c242a7f064ab71e63c7fa.webp" width="3024" height="1710" class="img_ev3q"></p>
<p>Next, enter the following command in the terminal inside your <strong>Price Tracking Project</strong>:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify login</span><br></div></code></pre></div></div>
<p>Select <code>Enter API Token Manually</code> , paste the token you copied from your account and hit enter.</p>
<p><img decoding="async" loading="lazy" alt="apify-login" src="data:image/webp;base64,UklGRkAZAABXRUJQVlA4WAoAAAAoAAAAVAIASgAASUNDUAwCAAAAAAIMYXBwbAQAAABtbnRyUkdCIFhZWiAH6QADAAIACwAxADlhY3NwQVBQTAAAAABBUFBMAAAAAAAAAAAAAAAAAAAAAAAA9tYAAQAAAADTLWFwcGyr6ljS/rLBRgz/7+fcGoQHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAApkZXNjAAAA/AAAADJjcHJ0AAABMAAAAFB3dHB0AAABgAAAABRyWFlaAAABlAAAABRnWFlaAAABqAAAABRiWFlaAAABvAAAABRyVFJDAAAB0AAAABBjaGFkAAAB4AAAACxiVFJDAAAB0AAAABBnVFJDAAAB0AAAABBtbHVjAAAAAAAAAAEAAAAMZW5VUwAAABYAAAAcAFMAYwBlAHAAdAByAGUAIABGADIANwAAbWx1YwAAAAAAAAABAAAADGVuVVMAAAA0AAAAHABDAG8AcAB5AHIAaQBnAGgAdAAgAEEAcABwAGwAZQAgAEkAbgBjAC4ALAAgADIAMAAyADVYWVogAAAAAAAA9tYAAQAAAADTLVhZWiAAAAAAAABkpAAAMzsAAAFeWFlaIAAAAAAAAGu+AAC9KQAAD61YWVogAAAAAAAAJnMAAA+cAADCInBhcmEAAAAAAAAAAAAB9gRzZjMyAAAAAAABC7cAAAWW///zVwAABykAAP3X///7t////aYAAAPaAADA9lZQOCDOFgAA8FUAnQEqVQJLAD6RRJ1JpaQjISeXGmiwEglpbt1fNlO7Abb3nj/RLvEW8d/u3lDHkD+4dov9n/Jfzr8Mnrv9d/cv1W8i/m/755kfyT7K/m/7V5g/6D+2eIPwZ/qfUI/Mv5r/uP7j4/OwKz7/PegF7H/XP+L/i/EQ/mP8R6iflX9f/5vuAfyv+n/7b0f/yfgU/c/9L+ynwC/zr+9/9f/Gf374UP6L/1f6PzifUX/t/1PwFf0L+6dZ70pyqgJUhT7s3WLIkzwKFQUEu9wU5d8R7mDpQKrOp6uFKbtUTokSi1tPdJtfpfFJm9mbaVPGtH3OMK00mFAREnPGrnguZFj5XvIbOVpq0uftFkdl6GRv4sGz4aFPWIVlUkU5shSS0A9CcFtft1NiIOZ72Y5CDp7gF/o8mh7mbuezzbFe0kE891QHY1SUTtl9I7+fO4GKkZUWTHFvD5xpuBn/IdHa4ra7WMHS+E5ugSJJsTnlS5tVt4MqfcvKnTse9ZxeD9Xux9V2UfAyBv7uCPG1LyUpfGX7iyS0EGx80uNSG85Drosncn+kEi9202Vaz+T2Lt7dFrhlQI0vIRmtC7/puzf0NbHLJAKVYycKCI0BrLA5JgRwxfE85T5C1Jift669oSqYTIsumRcPfdbS8CNN2Ozdc5iOfyFgqHEEXSbiC3rCAxrGvFCQrzcQsWoqlJfwHlCrzG5G7l9ZBrSuGHXucGk9OcR1jTuHuHoxSaoORtXrCifa9EDvuHc26Po8CmPdiK/w+RzMJyC7tUnsinlfnkyJf3VTGzSomjTkQKM8tqfFZUsGjMhZWGgUVEUOZGGRvmSzFMnPsn9afIrMIZghOIxfC91R1+UiQmhUsylsoG3y3fbfyZCBCT6DRfxHR+DtwvYv61Yii9JB5L4Oi+KI0kfB5LEgmJ1/118URo7wAP6hJnGQRk6IbxLrsBKgulT/zYsDrQVVhQhHF8QnxsnF7U1RqCoNq212Yr4cPTfodyrd+t/0wUeMaMYlnVBRPjAf8USni9m/pAtoJgdpZUvZ/qf6uAl8FDRP+nsFgfHnbMtuvefdUWd7es6dQlATQslvm7TzWfMYv0Zjfo/R80KymPC1cyaPJn+JaZpVl27te0d0sVgc4dFzFDZtK6LD9uox3tFaW9G38lUk8gQuWQurw07EBFUSA5z9o1UUwuA96pmNVwmEKpwlVXR3wny+2wIP86FbM60oNTtQqUnHMahh5xDBFG7G2NV6uj11SH3ORVXZZIXbRmj93/qQEuH9mc+Zxx5J0sJ39O30UUIbU27y8lou603Eufp/2jmyGtpuSrCdWLiyMqd/sW7AE64gtTZZJOEIBsK/lLHGFKi3De7kBr4RKOT7Iw8I0mFiSEo1FtTo7qO10sWdFlyvz71zsxWv1Ew6Ai66Pn4qdal+9MjrOVpTJYd0jhzjNne37AuWyZw8i3Kvrntnh3+9EMtRNehQJFX2xktkCWaHUn4CavLNLQf/pmz+z3NCNrHsNYR2xTmH7jnYtHtQ5DwVuOkrHB2oxVCS0iDF2rv2bH9uu26B1taeL0kHrPrKj7zAF20GP4c0Y60m+BORNa4DfnTDX9KLnO6Bf/iGTuzvSQBwQUH8IPKs5NHLacRadoiCnixz+Y8z420Dtk7RL18bCX3k+TYlhFyT7IjutlG/GU/+4hc0wzGmIaAD48CtVkMzARSDbinZpO6W8I3qXHA/pFPkz7tOU3kV8Q2+YmyB2PLXGJyr3kMRyTQn/JybjoCjN+r5JaXUd8c/PYGqS71yW2jMIy67eOpoEBVkLIcE2HM3L2yjf5Opg8WZaOm6jStDWxq3xmpfz2JRFMaH//NJxMKvDwCd1zEHyLpwKc5b5vllKicngkd84XLuTv1xY81myd+K4L4WJb9+qafVDvRk6IJQ5awrYij2BHxgE7z+nog891uAHCsC/BjJcEdI6lvX6mZpfwts0XXkvCpuQbMlsgsl9RUxuLj5mqKPnRNFBZBvW0bA/SsxfeS7PisWxnL8C5Qz8xF7W480VKze3ssZazEitbMBK40rHDUOEvf22MJ2duiBuc8oj7HsXG/fSKJuQaIcHwrR52UTTK738MnwPVZk8uv0Mrk3Ak7b7eLaYtn6yGtPbUtCXTRaEgSHn0ZWIKvsT/XM2lRNiVwvVtwf1yHVXXD0Qnb9k5Loft+VOh75T0MNEZ6VJMogx7Fvyb3oSl++9tPb0Rt4eJAfyJQY564ia82b2TNqgbvLatCg5zQMIteY9PwGTdJeslv9KfwysCLilrhKRf7JTUI3veGA9dh5ewC47dgHDxuaEYN7RSelbMC6xlG5Ssfk22bCTKkIzy8gjevancjx1Whr3CsXvPnUAqYdPbwvcQNdHvcM2dfLa+EeeoUqSXJytwuQgWLbcBA2Ocs5rUeEXo/5PQ10BTO2jDyfjlDed32453mvmgsXk+MdO+X9xmw97qSZevbwXHvU6/Ai4bQIketeHZmeuMf5xX205tGlKn/dKqEVMhV72oH1back5H0yCdMKGr1bTm2pisB5IXE6q+e/7DsvjSy23Jw5M+vIyE+Enur7t8edFHKVwER69/k5jYoPoqUBYnjUlkJPQcPxaHAbEdD1uCmt4eiPZKPSDc0H3bTf8NHG4/gDNQrAv+j4j6PKeOfhMMqJC1rOvy+eIPWtYo1J5dy52AsZ88kYVopIiN0IG30vKqAzPA1LYnvYTlwpldKi5KimxzUOG/4QIsg5eOf6QnHzfRfV661w5nOUvYIIVMnZRRgJHJqp10n8D1gWrXqbB58OYzlq5xMdmKwTPnpexWV2BUcpjsy1hi5jaYQSNGhrAsJ5SrV4JAoMTDOOMdEMdbkAKgQrAVgGWTsMCQvOOoyW7+5G4hHami6MLy67O8XMclSThJYOxvpVwbfn0rtkX7V2/7q8O10QBVn95v63jLuYmPrc9mjKBPvnxr5iP4zwS51/wxP8vfNqV1EqRgrlyPTlMDexuNFPb2uHgZwaE9EuttvMlLeJY0+0uBwW7+ZKwPYR3fBMhO2AYy7wCwuKON8CnVm/U6x4bUXlrp4RvVvRGS7V6HKsyMHRkXfwJ3mWS/+9qC1zrZVQ/P2wY3tU4kv94Js+U+7G44sNZq4c+/FZRGVKFM0v6gbc7ED9YjLxRz+2VPomnXmj1GlqSP56p8ewJL3E1VtNAvXBLDu4xOsc3dmuyMT0rxTteJLSm+/aYCOspVfCKifH6otiCLvmqezLYReNWGDJIBic+R3TXsetyYaVNtuLhQkB2WmOYINKAzyI2dfGTIkX9G54AYZkWvp36QKT1Vum9YWLdPE+0MApdUI2PjBUg98PGvjjmGgAGZvaJgR4yl35TozURJ4Yi8RK5B4Sf8HE+4iCANM3PIyNYRKgzdsKRs4QFzjTRatl0vmDUAz1hyJKYQT3WsJn20sRWG4uRJeg9wwSjRgF3LJOfTrJOolE29d4d2xadRCQIUiAdtVl86GdpZ7E70dhn+fJf2rhDAmykSq/6vA2U1PfUdOodL3Nz+H3IKSyzfTitquqQoRraMHFc/zeoeiebXKM724ZrNfsF2UN5YbLMcF+7xCVQdGwZTp975IMYRUDbJ87w2gj7sz3R0oa+g3wklRwDnnAQflPZ41N2BaNQDLX7icsf+mk2+IURt7ruZpU/JcJNNpRe5w/uOXB/hvjMD5FCHZjOQMi7v3gdRuBumPMbCCkuiWRwQjhVaeqahxJJ1L0oU9QBVsAljVrdEE2Eto4iEgBHDiw6vq/lScOjyKtK0jreuLKC1e2YryS3wrF9JgXmynz637sCqPZw678txHqc+qYq6KNSPE/lAgjKyX2p9HH+J4Ph9yUtB6X2qxo1B7XI4dqvtW6rYhhRZ3pQhQVjPc8Wlltci2gS8/8pm4caUPzVxgHW/Ou0F9i5rKFyXPyifg+n+QBPt47ImhTp/hQ//gaWzRPUVqOOrdpIJ62ZVsb4L7iYnQkxorX3nt/EvMrqinThMh9AF1I7fuuMVOldVUW6Y8aOCJfVoueqheobeXelACyo+FhM6133AC3Kqqy3INnf8YeVYvgHzNFWK+pZNa0DCgQvi0anFKXmHo6SiJ67zhYftKNXnoa10FWfb25ULTToTWDFa310/Z4mi6M361ecQY74UTUzrgC63RK1MarSQgFWWEH/ei3zjGSMk328RO8HM6hTmT7tBjKbPx5iO3JjE3fcp5bK3Fi/DOEvl3xPjBrlkMT6a2elS2EDU44HHZIHalf92dptsb2sE+8PNYkxRhL6ozKYIGKaLWLXDiKf4/5cJ/MrVBgsDILet4tCkMVbHYMJ1Lc4tHxVvSGCotcDrUA8wU0fDnaA1tDsFDkhTg4oZVpFsANg1Pi7TvykgHiYPhgeB4qLJlq1A0uCOGGh1++MleCbEyYRqxpmFeToUsoTtJtPVZ2C2BF8VHzy63l28Y0BvMy58tyK0ARb9Z0lCEm6zbB5IBMGWNANsq5zs5GJ0IfUBpOv3333VGBbwnaN6QJSSG7KF4O62g9ZJNV9IrPWpISWdqaYmh48nzdTg3btjImE6H0Cd8ZkJ5PkS/CO2gMjNaVKVPf3eECbv/S8wIeHY0uUe65wDhXPU9YyE+4bOVTWUZ8dmA+OdCjFVR5PDI/fS7hO+kxDnfeu7kXin6LHNL6KSRN+1/ZdCs+Mj4KcLsKTzNH0FFPEKct6vfEOld+JwQs0nn9BsmteWT3KdjNaEH8U5MogC3Qlertg7Go5q/WThmh+dI7Mo2TfHmH1+oto19Zb4hq4pWYC55RrK3NXXyo1T8bEMe7gzdsyfWfrVpFTZkx00ZjSP1tNCvSD+ThLZViGgWTCiYjZRWecjyX9vqy81CBb4A7Kdj5hiqawwvaMOXTG3bC78kzQinHmY5B/IJm9SHi970zDhIDNAus4gL6B0xZK7mADZ+Y4u3bhWJTME1LMzCEo2gsGDnsRRDIcsxcdIoYTa9HF4Kj4IMCG+DSnV91esm6SZXq+5g1fOCJoEM/PQSmmG4x0fi7cbVjBsyjdsaPB8JmTCUwO5uAV2S0d2O5Ldr7cEWTkUccDnHn5WS+84DnzYt3tDSEUrR8zp/EbVBhaBQdJQELQKpDQy+N/VGRecnUFmn/uMiZCmCN57wsruFu8Yl5w5nKFnuB/nWy9A9CA7+TJ4xBlk+PmUa9ZgIrdWS8vydx6h7EgphSn0GjJK6cwzlMqHuJwdaHu19MZCbmYKTPLaTmam2RH9rIZKusJYS71t21SA9exPRMMuKP2fbPkgFBLC34TZAHMuZrr4JkK3U2KFl+PmC3ezP8zc0G3goMjiPkfm7Qx/xTXN4p5QP3NtL6M1/jR3OCGiaer3j/55oO96iQnxb0dxpQ+OEvayj68aFZ4mx9+66R01H72K61mnASlN4iMbzib6GgPSVKswmBh3soc+K1yJN92pMxMlgEyx1Xly9maAAA75CofFZ7yvpCCJIA4xCXTz7rAGwoj7guravfYIyzT1sVkP40MHhBbb8odiLoyask5JtjuJqMw7Mx6/g/XbMuSYoyvLM/qvJrlEEsOC61Jk2eD9r2GqOb0yuzaKspTEOkoFRQxQ1TzyXEMDplJrAzXBg5Is9+a4gQGuKlPDU7jFMWqyZ79S3fWUVAybicET976oo4lPj4NoQS95Wzbb4BCYJsEaFiqbe35eN6NPnaVpwoLW0fm9yU939V+dBfGhmn/dxtWzFU/fT4+lsiseZU/yOHRf2Cv/AcUL9QGwj8t0c8CWzDsML6NQVF6l9twfVVXKEhsmHTh/3H2MiGlzh1csO+6Ia7/evG0ELKt4I1e0rSjsQ9poD4P4BfGATdPhZ6e4csAWXdguQ9hYEdIQUYiOoPwfnS+LrISerYvZ/mErEh4ziy5eUSjoLqfWtTb/gTT3DDZx+gBWraTQhlPwNg1B/BZnb8DnSURPa8cRnP3Nho1wAJaQDTEYycHankCjedseLuPYBx3awwEGbxCYbmdjbtvlWoxnX4hEQ2TybALWTSmB00MiXoD4rO0Gc/rGL5UT1DEIqos+BAQ1KzmwlVNHEHvpd6CywloDi7iaP8+zKHa6AjwnpzG3eJy8FIQDxMHwwPAnchHuzDTMMlUi8N41AFivjIlftGSysuRBEjNriqXC+RBEontuOwoOAqukoPOmjAuORXeLxyD5wxEgPV9mQZNoLZaasck712yVroju5Hw+fiZF0n4moXZXx1urg3fa61Wk1kcMqfmBNYkEj4Ff+xDLNDc1bSkHqXz0OEhPj+m6b1S+4MLO3q1irclOZW3j+UlfW3YyIahH0brrM5wK6o5nlV5R6EsAp+yc4mogwnWixXtH3n0yTkNGdeEX/KPFoyH5itgJ/Yf+vI9FAuAdCFFjpuo9FPND9AlafBM8CEZ5YXsjneOTvLr5UarLsKNmoNes4Lw1blTlrd5WB2/I4XXnlLqnO2CyqwdOed9Hh2fg6kRylXIM0jqIHLbxnG2zjNsfsERmk1BJRRkXKN1thKBvNDJbMzYXsD5CnlF9CFpkYWQF2P40d1kpewONrNBfq34IQGqp5uAFBuHZxLKRinb7NkcqqFTtJIPuBCspnpydgcxqgJx04v9Ga/JdROq426VrhOlo09arduVv5XbGUIl+8zmxU5k6l9xgA0WegarqfF+gAAanpyiX/qTWqZXBZdnxLmkKLRJNnyGY/Sor9whIPC4fhy2pQiSkoScVwfIaAJcMj9LBS181UXnj8uPbud/TodIfr4/pBsiuSPsYRbiDeHM+RnNrR9N7NbBMCjnqPlzQ7UHAWsQ5CMOTlU8hpVb5DDZlfEauXJ4nMlBUWi4Rg0Rrk4ILtY+S6nRdCdBdI6YKRvgx4uZW0SgPJrZZ0d0BCT1FfFro0Qus8IKqqlXB+SXpOZITNwpVcLs+VSpHVtsQ19O6CQk5ynwPJZATjgLcseIYRGpsP6sZqm9a/DmyC7OAWfgD18GsKNTEzu0UeEtP2ZDKjAlzaaupA03RJ1N0kM+YWU3jQSaPL6hA3i8prohWJrPD02xNbuqqYBe8eWgwjOqJRUDwGfbvfO1IeSk4rKET1IaHxhR3HWgmsX3+HytP2tCR5EgCPhcykbtP8q7SAPSUelIq/PJJlEQqPqy7f411W4/fqQm+QVPkOkLtgRSUbD0S9QABNX2l3hSUYQsaIx1vKcKMPkIyyNTwnCQTvrmWk6y9i81os2Y7eXXsWESJ+RNtLBFdYV+v5G1vWwmwc5ZKTqlT2KOVE3HO/jJ+uKQLU2m/K1xh/7i0BMGltL/HdyHyj2ZiJuB5Cvx6dsPYSiCEqH55gAAABGYZs0snuxZgR2bOTS7zKzVdB1S5ZIhWfYKTBf/zIGhEYWsx7usIvfGk1sduusG66oapz4yAomOhlCpfBxwYCR40VxD7HTRM7Pf/pUOpCAZAhXyik0+O3f8UwP6jFavd0vFfLSk38vKR/4Rden15AAYWMFDbRdP92kLOYDCqk/ZkUT7+vZvp/rntH1zzJKKLc8ts+vdou/wAoxSheZbcHxu8disMonsc9Ezz83Q4YJYKIF/J4nKhWR07WklLgz2unYfGISpakg+XVg+VF0IPnc6ZssIPQ5Tatlr8Ixxk/wDovCXP0/7awjh52GmswYk6rKFKz/spVSMPzdv2EfMK8e81Q/8okO+uKon21uawhvCjKcmNFIAWpELoX6iwrq2wBK1gi4e5OdI4TaYPt7NmnW7OLFXrdAY2jMwxsREkcVoold+VxiQOcRxjj/bWbCV4K/QpEgaYNzkJ6BDiw9Cj/YWCaNF/FCmaQgAAAAAAAARVhJRjgAAABNTQAqAAAACAABh2kABAAAAAEAAAAaAAAAAAACoAIABAAAAAEAAAJVoAMABAAAAAEAAABLAAAAAA==" width="597" height="75" class="img_ev3q"></p>
<p>You’ll see a confirmation that you’re now logged into your Apify account. When you run the code, the API token will be automatically inferred from your account, allowing you to use the <strong>Send Email Actor</strong>.</p>
<p>If you encountered any issues, double-check that your code matches the one below:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> apify </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Actor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BeautifulSoupCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> BeautifulSoupCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Enter the context of the Actor.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Create a crawler.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> BeautifulSoupCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Limit the crawl to max requests. Remove or increase it for crawling all links.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Define a request handler, which will be called for every request.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BeautifulSoupCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Scraping </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Select the product name and price elements.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            product_name_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'div'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> class_</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'productname'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            product_price_element </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'span'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">id</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'product-price-395001'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Extract the desired data.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'product_name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> product_name_element</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> product_name_element </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'price'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">float</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">product_price_element</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'data-price-amount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> product_price_element </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            price_threshold </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">80</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'price'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;</span><span class="token plain"> price_threshold</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                actor_run </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">start</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    actor_id</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"apify/send-mail"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    run_input</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                        </span><span class="token string" style="color:#e3116c">"to"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"your_email@gmail.com"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                        </span><span class="token string" style="color:#e3116c">"subject"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Python Price Alert"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                        </span><span class="token string" style="color:#e3116c">"text"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"The price of '</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'product_name'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">' has dropped below $</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">price_threshold</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> and is now $</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'price'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">.\n\nCheck it out here: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'url'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Email sent with run ID: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">actor_run</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation builtin">id</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Store the extracted data to the default dataset.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the starting requests.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<blockquote>
<p>🔖 Replace the placeholder email address with your actual email, the one where you want to receive notifications. Make sure it matches the email you used to register your <strong>Apify account</strong>.</p>
</blockquote>
<p>Then, run the code using:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify run </span><span class="token parameter variable" style="color:#36acaa">--purge</span><br></div></code></pre></div></div>
<p>If everything works correctly, you should receive an email like the one below in your inbox.</p>
<p><img decoding="async" loading="lazy" alt="price-alert" src="https://crawlee.dev/assets/images/price-alet-530cccd85b681fd98e32a81e4f52e488.webp" width="1182" height="376" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-deploying-your-actor">4. Deploying your Actor<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#4-deploying-your-actor" class="hash-link" aria-label="Direct link to 4. Deploying your Actor" title="Direct link to 4. Deploying your Actor" translate="no">​</a></h2>
<p>It’s time to deploy your Actor to the cloud, allowing it to take full advantage of the Apify Platform’s features.</p>
<p>Fortunately, this process is incredibly simple. Since you’re already logged into your account, just run the following command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify push</span><br></div></code></pre></div></div>
<p>In just a few seconds, you’ll find your newly created Actor in your Apify account by navigating to <strong>Actors → Development → Price Tracking Actor</strong>.</p>
<p><img decoding="async" loading="lazy" alt="price-tracking-actor" src="https://crawlee.dev/assets/images/price-tracking-actor-c91e4f5243ea20363d2621424d89985f.webp" width="1898" height="754" class="img_ev3q"></p>
<p>Note that the <strong>Start URLs</strong> input has been reset to <strong>apify.com</strong>, so be sure to replace it with our target website:</p>
<p><a href="https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html" target="_blank" rel="noopener noreferrer">https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html</a></p>
<p>Once updated, click the green <em><strong>Save &amp; Start</strong></em> button at the bottom of the page to run your Actor.</p>
<p>After the run completes, you’ll see a <strong>preview of the results</strong> in the <em><strong>Output</strong></em> tab. You can also export your data in multiple formats from the <em><strong>Storage</strong></em> tab.</p>
<p><img decoding="async" loading="lazy" alt="actor-run" src="https://crawlee.dev/assets/images/actor-run-faa6f7deb56846b88c7d446e9eb05e1d.webp" width="1675" height="450" class="img_ev3q"></p>
<p><strong>Export dataset:</strong></p>
<p><img decoding="async" loading="lazy" alt="actor-export-dataset" src="https://crawlee.dev/assets/images/export-dataset-9d56cd86006ff21fbbd695a72cd5529c.webp" width="1673" height="562" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-schedule-your-runs">5. Schedule your runs<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#5-schedule-your-runs" class="hash-link" aria-label="Direct link to 5. Schedule your runs" title="Direct link to 5. Schedule your runs" translate="no">​</a></h2>
<p>Now, a <strong>price monitoring script</strong> wouldn’t be very effective unless it ran on a schedule, automatically checking the product’s price and notifying us when it drops below the threshold.</p>
<p>Since our Actor is already deployed on <strong>Apify</strong>, scheduling it to run, say, every hour, is incredibly simple.</p>
<p>On your Actor page, click the three dots in the top-right corner of the screen and select <strong>“Schedule Actor.”</strong></p>
<p><img decoding="async" loading="lazy" alt="schedule-run" src="https://crawlee.dev/assets/images/schedule-run-3c2c1975cb23d5f4bdbe8116172a2a47.webp" width="1684" height="689" class="img_ev3q"></p>
<p>Next, choose how often you want your Actor to run, and that’s it! Your script will now run in the cloud, continuously monitoring the product’s price and sending you an email notification whenever it goes on sale.</p>
<p><img decoding="async" loading="lazy" alt="actor-schedule" src="https://crawlee.dev/assets/images/actor-schedule-2fe3df75d91fa3270776f814ed6888dc.webp" width="692" height="716" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="thats-a-wrap">That’s a wrap!<a href="https://crawlee.dev/blog/crawlee-python-price-tracker#thats-a-wrap" class="hash-link" aria-label="Direct link to That’s a wrap!" title="Direct link to That’s a wrap!" translate="no">​</a></h2>
<p>Congratulations on completing this tutorial! I hope you enjoyed getting your feet wet with Crawlee and feel confident enough to tweak the code to build your own price tracker.</p>
<p>We’ve only scratched the surface of what Apify and Crawlee can do. As a next step, join our <a href="https://discord.com/invite/jyEM2PRvMU" target="_blank" rel="noopener noreferrer">Discord community</a> to connect with other web scraping developers and stay up to date with the latest news about Crawlee and Apify!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to scrape Bluesky with Python]]></title>
            <link>https://crawlee.dev/blog/scrape-bluesky-using-python</link>
            <guid>https://crawlee.dev/blog/scrape-bluesky-using-python</guid>
            <pubDate>Thu, 20 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape Bluesky using Crawlee for Python]]></description>
            <content:encoded><![CDATA[<p><a href="https://bsky.app/" target="_blank" rel="noopener noreferrer">Bluesky</a> is an emerging social network developed by former members of the <a href="https://x.com/" target="_blank" rel="noopener noreferrer">Twitter</a>(now X) development team. The platform has been showing significant growth recently, reaching 140.3 million visits according to <a href="https://www.similarweb.com/website/bsky.app/#traffic" target="_blank" rel="noopener noreferrer">SimilarWeb</a>. Like X, Bluesky generates a vast amount of data that can be used for analysis. In this article, we’ll explore how to collect this data using <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">Crawlee for Python</a>.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to the Crawlee Blog. If you’d like to contribute articles like these, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">discord channel</a>.</p></div></div>
<p><img decoding="async" loading="lazy" alt="Banner article" src="https://crawlee.dev/assets/images/scrape-bluesky-using-python-723c9a74dadb375da06226b1a6a29e10.webp" width="1152" height="648" class="img_ev3q"></p>
<p>Key steps we will cover:</p>
<ol>
<li class="">Project setup</li>
<li class="">Development of the Bluesky crawler in Python</li>
<li class="">Create Apify Actor for Bluesky crawler</li>
<li class="">Conclusion and repository access</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites">Prerequisites<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h2>
<ul>
<li class="">Basic understanding of web scraping concepts</li>
<li class="">Python 3.9 or higher</li>
<li class=""><a href="https://docs.astral.sh/uv/" target="_blank" rel="noopener noreferrer">UV</a> version 0.6.0 or higher</li>
<li class="">Crawlee for Python v0.6.5 or higher</li>
<li class="">Bluesky account for API access</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="project-setup">Project setup<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#project-setup" class="hash-link" aria-label="Direct link to Project setup" title="Direct link to Project setup" translate="no">​</a></h3>
<p>In this project, we’ll use UV for package management and a specific Python version installed through UV. UV is a fast and modern package manager written in Rust.</p>
<ol>
<li class="">
<p>If you don’t have UV installed yet, follow the <a href="https://docs.astral.sh/uv/getting-started/installation/" target="_blank" rel="noopener noreferrer">guide</a> or use this command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">curl</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-LsSf</span><span class="token plain"> https://astral.sh/uv/install.sh </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">sh</span><br></div></code></pre></div></div>
</li>
<li class="">
<p>Install standalone Python using UV:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uv </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> python </span><span class="token number" style="color:#36acaa">3.13</span><br></div></code></pre></div></div>
</li>
<li class="">
<p>Create a new project and install Crawlee for Python:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uv init bluesky-crawlee </span><span class="token parameter variable" style="color:#36acaa">--package</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token builtin class-name">cd</span><span class="token plain"> bluesky-crawlee</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">uv </span><span class="token function" style="color:#d73a49">add</span><span class="token plain"> crawlee</span><br></div></code></pre></div></div>
</li>
</ol>
<p>We’ve created a new isolated Python project with all the necessary dependencies for Crawlee.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="development-of-the-bluesky-crawler-in-python">Development of the Bluesky crawler in Python<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#development-of-the-bluesky-crawler-in-python" class="hash-link" aria-label="Direct link to Development of the Bluesky crawler in Python" title="Direct link to Development of the Bluesky crawler in Python" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before going ahead with the project, I'd like to ask you to star Crawlee for Python on <a href="https://github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">GitHub</a>, it helps us to spread the word to fellow scraper developers.</p></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-identifying-the-data-source">1. Identifying the data source<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#1-identifying-the-data-source" class="hash-link" aria-label="Direct link to 1. Identifying the data source" title="Direct link to 1. Identifying the data source" translate="no">​</a></h3>
<p>When accessing the <a href="https://bsky.app/search?q=apify" target="_blank" rel="noopener noreferrer">search page</a>, you'll see data displayed, but be aware of a key limitation: the site only allows viewing the first page of results, preventing access to any additional pages.</p>
<p><img decoding="async" loading="lazy" alt="Search Limit" src="https://crawlee.dev/assets/images/search_limit-c8ee1da0dc9b48fdb6fb125600519ee3.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>Fortunately, Bluesky provides a well-documented <a href="https://docs.bsky.app/docs/get-started" target="_blank" rel="noopener noreferrer">API</a> that is accessible to any registered user without additional permissions. This is what we’ll use for data collection</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-creating-a-session-for-api-interaction">2. Creating a session for API interaction<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#2-creating-a-session-for-api-interaction" class="hash-link" aria-label="Direct link to 2. Creating a session for API interaction" title="Direct link to 2. Creating a session for API interaction" translate="no">​</a></h3>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>For secure API interaction, you need to create a dedicated app password instead of using your main account password.</p><p>Go to Settings -&gt; Privacy and Security -&gt; <a href="https://bsky.app/settings/app-passwords" target="_blank" rel="noopener noreferrer">App Passwords</a> and click <em>Add App Password</em>.
Important: Save the generated password, as it won’t be visible after creation.</p></div></div>
<p>Next, create environment variables to store your credentials:</p>
<ul>
<li class="">Your application password</li>
<li class="">Your user identifier (found in your profile and Bluesky URL, for example: <a href="https://bsky.app/profile/mantisus.bsky.social" target="_blank" rel="noopener noreferrer"><code>mantisus.bsky.social</code></a>)</li>
</ul>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token builtin class-name">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:#36acaa">BLUESKY_APP_PASSWORD</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">your_app_password</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token builtin class-name">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:#36acaa">BLUESKY_IDENTIFIER</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">your_identifier</span><br></div></code></pre></div></div>
<p>Using the <a href="https://docs.bsky.app/docs/api/com-atproto-server-create-session" target="_blank" rel="noopener noreferrer">createSession</a>, <a href="https://docs.bsky.app/docs/api/com-atproto-server-delete-session" target="_blank" rel="noopener noreferrer">deleteSession</a> endpoints and <a href="https://www.python-httpx.org/" target="_blank" rel="noopener noreferrer"><code>httpx</code></a>, we can create a session for API interaction.</p>
<p>Let us create a class with the necessary methods:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> json</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> os</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> traceback</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> httpx</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> yarl </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> URL</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">configuration </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Configuration</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> HttpCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpxHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storages </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Dataset</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Environment variables for authentication</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">BLUESKY_APP_PASSWORD </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> os</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getenv</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'BLUESKY_APP_PASSWORD'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">BLUESKY_IDENTIFIER </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> os</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getenv</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'BLUESKY_IDENTIFIER'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">BlueskyApiScraper</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""A scraper class for extracting data from Bluesky social network using their official API.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="display:inline-block;color:#e3116c"></span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    This scraper manages authentication, concurrent requests, and data collection for both</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    posts and user profiles. It uses separate datasets for storing post and user information.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">__init__</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_crawler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawler </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_users</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Dataset </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_posts</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Dataset </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Variables for storing session data</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_service_endpoint</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_user_did</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_access_token</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_refresh_token</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_handle</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">create_session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Create credentials for the session."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://bsky.social/xrpc/com.atproto.server.createSession'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Content-Type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'application/json'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'identifier'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BLUESKY_IDENTIFIER</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'password'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BLUESKY_APP_PASSWORD</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> httpx</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> json</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">raise_for_status</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_service_endpoint </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'didDoc'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'service'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'serviceEndpoint'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_user_did </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'didDoc'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'id'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_access_token </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'accessJwt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_refresh_token </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'refreshJwt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_handle </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'handle'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">delete_session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Delete the current session."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">_service_endpoint</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/xrpc/com.atproto.server.deleteSession'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'Content-Type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'application/json'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'authorization'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f'Bearer </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">_refresh_token</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> httpx</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">raise_for_status</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The session expires after 2 hours, so if you plan for your crawler to run longer, you should also add a method for <a href="https://docs.bsky.app/docs/api/com-atproto-server-refresh-session" target="_blank" rel="noopener noreferrer">refresh</a>.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-configuring-crawlee-for-python-for-data-collection">3. Configuring Crawlee for Python for data collection<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#3-configuring-crawlee-for-python-for-data-collection" class="hash-link" aria-label="Direct link to 3. Configuring Crawlee for Python for data collection" title="Direct link to 3. Configuring Crawlee for Python for data collection" translate="no">​</a></h3>
<p>Since we’ll be using the official API, we do not need to worry about being blocked by Bluesky. However, we should be careful with the number of requests to avoid overloading Bluesky's servers, so we will configure <a href="https://www.crawlee.dev/python/api/class/ConcurrencySettings" target="_blank" rel="noopener noreferrer"><code>ConcurrencySettings</code></a>. We’ll also configure <a href="https://www.crawlee.dev/python/api/class/HttpxHttpClient" target="_blank" rel="noopener noreferrer"><code>HttpxHttpClient</code></a> to use custom headers with the current session's <code>Authorization</code>.</p>
<p>We’ll use 2 endpoints for data collection: <a href="https://docs.bsky.app/docs/api/app-bsky-feed-search-posts" target="_blank" rel="noopener noreferrer">searchPosts</a> for posts and <a href="https://docs.bsky.app/docs/api/app-bsky-actor-get-profile" target="_blank" rel="noopener noreferrer">getProfile</a>. If you plan to scale the crawler, you can use <a href="https://docs.bsky.app/docs/api/app-bsky-actor-get-profiles" target="_blank" rel="noopener noreferrer">getProfiles</a> for user data, but in this case, you’ll need to implement deduplication logic. When each link is unique, Crawlee for Python handles this for you.</p>
<p>When collecting data, I’d like to separately collect user and post data, so we’ll use different <a href="https://www.crawlee.dev/python/api/class/Dataset" target="_blank" rel="noopener noreferrer"><code>Dataset</code></a> instances for storage.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">init_crawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Initialize the crawler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_user_did</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> ValueError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Session not created.'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Initialize the datasets purge the data if it is not empty</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_users </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Dataset</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'users'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">Configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">purge_on_start</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_posts </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Dataset</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'posts'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">Configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">purge_on_start</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Initialize the crawler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> HttpCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HttpxHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Set headers for API requests</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headers</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'Content-Type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'application/json'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'Authorization'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f'Bearer </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">_access_token</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'Connection'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Keep-Alive'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'accept-encoding'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'gzip, deflate, br, zstd'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Configuring concurrency of crawling requests</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            min_concurrency</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            desired_concurrency</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_concurrency</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">30</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">200</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_search_handler</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Handler for search requests</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'user'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_user_handler</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Handler for user requests</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-implementing-handlers-for-data-collection">4. Implementing handlers for data collection<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#4-implementing-handlers-for-data-collection" class="hash-link" aria-label="Direct link to 4. Implementing handlers for data collection" title="Direct link to 4. Implementing handlers for data collection" translate="no">​</a></h3>
<p>Now we can implement the handler for searching posts. We’ll save the retrieved posts in <code>self._posts</code> and create requests for user data, placing them in the crawler's queue. We also need to handle pagination by forming the link to the next search page.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_search_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing search </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'posts'</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">warning</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'No posts found in response: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    user_requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    posts </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    profile_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> URL</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">_service_endpoint</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/xrpc/app.bsky.actor.getProfile'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> post </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'posts'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Add user request if not already added in current context</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> user_requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            user_requests</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                url</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">profile_url</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">with_query</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">actor</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'label'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'user'</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        posts</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">append</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'cid'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cid'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'author_did'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'created'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'createdAt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'indexed'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'indexedAt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'reply_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'replyCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'repost_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'repostCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'like_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'likeCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'quote_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'quoteCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'text'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'text'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'langs'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'; '</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'langs'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'reply_parent'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'reply'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'parent'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'reply_root'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'reply'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'root'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_posts</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">posts</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Push a batch of posts to the dataset</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">values</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> cursor </span><span class="token operator" style="color:#393A34">:=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'cursor'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        next_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> URL</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">update_query</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'cursor'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cursor</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Use yarl for update the query string</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">next_url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>When receiving user data, we'll store it in the corresponding Dataset <code>self._users</code></p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_user_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing user </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    user_item </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'created'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'createdAt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'avatar'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'avatar'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'display_name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'displayName'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'handle'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'handle'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'indexed'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'indexedAt'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'posts_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'postsCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'followers_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'followersCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'follows_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'followsCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_users</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_item</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-saving-data-to-files">5. Saving data to files<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#5-saving-data-to-files" class="hash-link" aria-label="Direct link to 5. Saving data to files" title="Direct link to 5. Saving data to files" translate="no">​</a></h3>
<p>For saving results, we will use the <a href="https://www.crawlee.dev/python/api/class/Dataset#write_to_json" target="_blank" rel="noopener noreferrer"><code>write_to_json</code></a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">save_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Save the data."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_users </span><span class="token keyword" style="color:#00009f">or</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_posts</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> ValueError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Datasets not initialized.'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> </span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'users.json'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'w'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> f</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_users</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write_to_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">f</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> indent</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">4</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> </span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'posts.json'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'w'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> f</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_posts</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write_to_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">f</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> indent</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">4</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="6-running-the-crawler">6. Running the crawler<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#6-running-the-crawler" class="hash-link" aria-label="Direct link to 6. Running the crawler" title="Direct link to 6. Running the crawler" translate="no">​</a></h3>
<p>We have everything needed to complete the crawler. We just need a method to execute the crawling - let us call it <code>crawl</code></p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">crawl</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> queries</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Crawl the given URL."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_crawler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> ValueError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Crawler not initialized.'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    search_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> URL</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">_service_endpoint</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/xrpc/app.bsky.feed.searchPosts'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">search_url</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">with_query</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">q</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">query</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> query </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> queries</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Let's finalize the code:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Main execution function that orchestrates the crawling process.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="display:inline-block;color:#e3116c"></span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    Creates a scraper instance, manages the session, and handles the complete</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    crawling lifecycle including proper cleanup on completion or error.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    scraper </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> BlueskyApiScraper</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create_session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">init_crawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawl</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'python'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'apify'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'crawlee'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> Exception</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        traceback</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">print_exc</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">finally</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">delete_session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Entry point for the crawler application."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>If you check your <code>pyproject.toml</code>, you will see that UV created an entrypoint for running <code>bluesky-crawlee = "bluesky_crawlee:main"</code>, so we can run our crawler simply by executing:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uv run bluesky-crawlee</span><br></div></code></pre></div></div>
<p>Let's look at sample results:</p>
<p>Posts</p>
<p><img decoding="async" loading="lazy" alt="Posts Example" src="https://crawlee.dev/assets/images/posts-9156686b24a69b73efbc3915f1c8d18e.webp" width="1411" height="646" class="img_ev3q"></p>
<p>Users</p>
<p><img decoding="async" loading="lazy" alt="Users Example" src="https://crawlee.dev/assets/images/users-d896c9f24165a0e970d2b26c54def9eb.webp" width="1398" height="517" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="create-apify-actor-for-bluesky-crawler">Create Apify Actor for Bluesky crawler<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#create-apify-actor-for-bluesky-crawler" class="hash-link" aria-label="Direct link to Create Apify Actor for Bluesky crawler" title="Direct link to Create Apify Actor for Bluesky crawler" translate="no">​</a></h2>
<p>We already have a fully functional implementation for local execution. Let us explore how to adapt it for running on the <a href="https://apify.com/" target="_blank" rel="noopener noreferrer">Apify Platform</a> and transform in <a href="https://docs.apify.com/platform/actors" target="_blank" rel="noopener noreferrer">Apify Actor</a>.</p>
<p>An Actor is a simple and efficient way to deploy your code in the cloud infrastructure on the Apify Platform. You can flexibly interact with the Actor, <a href="https://docs.apify.com/platform/schedules" target="_blank" rel="noopener noreferrer">schedule regular runs</a> for monitoring data, or <a href="https://docs.apify.com/platform/integrations" target="_blank" rel="noopener noreferrer">integrate</a> with other tools to build data processing flows.</p>
<p>First, create an <code>.actor</code> directory with platform configuration files:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">mkdir</span><span class="token plain"> .actor </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">touch</span><span class="token plain"> .actor/</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">actor.json,Dockerfile,input_schema.json</span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Then add <a href="https://docs.apify.com/sdk/python/" target="_blank" rel="noopener noreferrer">Apify SDK for Python</a> as a project dependency:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uv </span><span class="token function" style="color:#d73a49">add</span><span class="token plain"> apify</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="configure-dockerfile">Configure Dockerfile<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#configure-dockerfile" class="hash-link" aria-label="Direct link to Configure Dockerfile" title="Direct link to Configure Dockerfile" translate="no">​</a></h3>
<p>We’ll use the official <a href="https://docs.apify.com/academy/deploying-your-code/docker-file" target="_blank" rel="noopener noreferrer">Apify Docker image</a> along with recommended <a href="https://docs.astral.sh/uv/guides/integration/docker/" target="_blank" rel="noopener noreferrer">UV practices for Docker</a>:</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token instruction keyword" style="color:#00009f">FROM</span><span class="token instruction"> apify/actor-python:3.13</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">ENV</span><span class="token instruction"> PATH=</span><span class="token instruction string" style="color:#e3116c">'/app/.venv/bin:$PATH'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">WORKDIR</span><span class="token instruction"> /app</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">COPY</span><span class="token instruction"> </span><span class="token instruction options property" style="color:#36acaa">--from</span><span class="token instruction options punctuation" style="color:#393A34">=</span><span class="token instruction options string" style="color:#e3116c">ghcr.io/astral-sh/uv:latest</span><span class="token instruction"> /uv /uvx /bin/</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">COPY</span><span class="token instruction"> pyproject.toml uv.lock ./</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">RUN</span><span class="token instruction"> uv sync --frozen --no-install-project --no-editable -q --no-dev</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">COPY</span><span class="token instruction"> . .</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">RUN</span><span class="token instruction"> uv sync --frozen --no-editable -q --no-dev</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token instruction keyword" style="color:#00009f">CMD</span><span class="token instruction"> [</span><span class="token instruction string" style="color:#e3116c">"bluesky-crawlee"</span><span class="token instruction">]</span><br></div></code></pre></div></div>
<p>Here, <code>bluesky-crawlee</code> refers to the entrypoint specified in <code>pyproject.toml</code>.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="define-project-metadata-in-actorjson">Define project metadata in actor.json<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#define-project-metadata-in-actorjson" class="hash-link" aria-label="Direct link to Define project metadata in actor.json" title="Direct link to Define project metadata in actor.json" translate="no">​</a></h3>
<p>The <code>actor.json</code> file contains project metadata for Apify Platform. Follow the <a href="https://docs.apify.com/platform/actors/development/actor-definition/actor-json" target="_blank" rel="noopener noreferrer">documentation for proper configuration</a>:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"actorSpecification"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Bluesky-Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Bluesky - Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"minMemoryMbytes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">128</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"maxMemoryMbytes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2048</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Scrape data products from bluesky"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"version"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"0.1"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"meta"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"templateId"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"bluesky-crawlee"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"input"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"./input_schema.json"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"dockerfile"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"./Dockerfile"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="define-actor-input-parameters">Define Actor input parameters<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#define-actor-input-parameters" class="hash-link" aria-label="Direct link to Define Actor input parameters" title="Direct link to Define Actor input parameters" translate="no">​</a></h3>
<p>Our crawler requires several external parameters. Let’s define them:</p>
<ul>
<li class="">identifier: User's Bluesky identifier (encrypted for security)</li>
<li class="">appPassword: Bluesky app password (encrypted)</li>
<li class="">queries: List of search queries for crawling</li>
<li class="">maxRequestsPerCrawl: Optional limit for testing</li>
<li class="">mode: Choose between collecting posts or user data who post on specific topics</li>
</ul>
<p>Configure the input schema following the <a href="https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1" target="_blank" rel="noopener noreferrer">specification</a>:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Bluesky - Crawlee"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"object"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"schemaVersion"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"properties"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"identifier"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Bluesky identifier"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Bluesky identifier for API login"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"string"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"textfield"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"isSecret"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"appPassword"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Bluesky app password"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Bluesky app password for API"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"string"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"textfield"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"isSecret"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"maxRequestsPerCrawl"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Max requests per crawl"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Maximum number of requests for crawling"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"integer"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"queries"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Queries"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"array"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Search queries"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"editor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"stringList"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"prefill"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"apify"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"example"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"apify"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"crawlee"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"mode"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Mode"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"string"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Collect posts or users who post on a topic"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"enum"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"posts"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"users"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"default"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"posts"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"required"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"identifier"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"appPassword"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"queries"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"mode"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="update-project-code">Update project code<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#update-project-code" class="hash-link" aria-label="Direct link to Update project code" title="Direct link to Update project code" translate="no">​</a></h3>
<p>Remove environment variables and parameterize the code according to the Actor input parameters. Replace named datasets with the default dataset.</p>
<p>Add Actor logging:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># __init__.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> logging</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> apify</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ActorLogFormatter</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">handler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> logging</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">StreamHandler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">handler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">setFormatter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">ActorLogFormatter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">apify_client_logger </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> logging</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getLogger</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'apify_client'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">apify_client_logger</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">setLevel</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">logging</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">INFO</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">apify_client_logger</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">addHandler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">handler</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">apify_logger </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> logging</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getLogger</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'apify'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">apify_logger</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">setLevel</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">logging</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">DEBUG</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">apify_logger</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">addHandler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">handler</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Update imports and entry point code:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> json</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> traceback</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> dataclasses </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> dataclass</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> httpx</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> apify </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Actor</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> yarl </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> URL</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> HttpCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpxHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@dataclass</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">ActorInput</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Actor input schema."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    identifier</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    app_password</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    queries</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    mode</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    max_requests_per_crawl</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Optional</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">int</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Main execution function that orchestrates the crawling process.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="display:inline-block;color:#e3116c"></span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    Creates a scraper instance, manages the session, and handles the complete</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    crawling lifecycle including proper cleanup on completion or error.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token builtin">raw_input</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_input</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        actor_input </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ActorInput</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            identifier</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">raw_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'indentifier'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            app_password</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">raw_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'appPassword'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            queries</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">raw_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'queries'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            mode</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">raw_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'mode'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'posts'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">raw_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'maxRequestsPerCrawl'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        scraper </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> BlueskyApiScraper</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">max_requests_per_crawl</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create_session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">identifier</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">app_password</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">init_crawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawl</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">actor_input</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">queries</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> httpx</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">HTTPError </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">error</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'HTTP error occurred: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">e</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> Exception </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">error</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Unexpected error: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">e</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            traceback</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">print_exc</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">finally</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">delete_session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Entry point for the scraper application."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Update methods with Actor input parameters:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">BlueskyApiScraper</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""A scraper class for extracting data from Bluesky social network using their official API.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="display:inline-block;color:#e3116c"></span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    This scraper manages authentication, concurrent requests, and data collection for both</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    posts and user profiles. It uses separate datasets for storing post and user information.</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">__init__</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> mode</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> max_request</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">int</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_crawler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawler </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> mode</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">max_request </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> max_request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Variables for storing session data</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_service_endpoint</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_user_did</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_access_token</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_refresh_token</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_handle</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">create_session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> identifier</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> password</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Create credentials for the session."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://bsky.social/xrpc/com.atproto.server.createSession'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Content-Type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'application/json'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'identifier'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> identifier</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'password'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> password</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> httpx</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> json</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">raise_for_status</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_service_endpoint </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'didDoc'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'service'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'serviceEndpoint'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_user_did </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'didDoc'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'id'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_access_token </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'accessJwt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_refresh_token </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'refreshJwt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_handle </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'handle'</span><span class="token punctuation" style="color:#393A34">]</span><br></div></code></pre></div></div>
<p>Implement mode-aware data collection logic:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_search_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle search requests based on mode."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing search </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'posts'</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">warning</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'No posts found in response: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    user_requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    posts </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    profile_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> URL</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">_service_endpoint</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/xrpc/app.bsky.actor.getProfile'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> post </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'posts'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'users'</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">and</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> user_requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            user_requests</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                url</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">profile_url</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">with_query</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">actor</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'label'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'user'</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">elif</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'posts'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            posts</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">append</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'cid'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'cid'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'author_did'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'author'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'created'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'createdAt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'indexed'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'indexedAt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'reply_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'replyCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'repost_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'repostCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'like_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'likeCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'quote_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'quoteCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'text'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'text'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'langs'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'; '</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'langs'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'reply_parent'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'reply'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'parent'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'reply_root'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> post</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'record'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'reply'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'root'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'uri'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'posts'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">posts</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">values</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> cursor </span><span class="token operator" style="color:#393A34">:=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'cursor'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        next_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> URL</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">update_query</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'cursor'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cursor</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">next_url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Update the user handler for the default dataset:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_user_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle user profile requests."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing user </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    user_item </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'did'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'created'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'createdAt'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'avatar'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'avatar'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'display_name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'displayName'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'handle'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'handle'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'indexed'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'indexedAt'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'posts_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'postsCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'followers_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'followersCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'follows_count'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'followsCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_item</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="deploy">Deploy<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#deploy" class="hash-link" aria-label="Direct link to Deploy" title="Direct link to Deploy" translate="no">​</a></h3>
<p>Use the official <a href="https://docs.apify.com/cli/" target="_blank" rel="noopener noreferrer">Apify CLI</a> to upload your code:</p>
<p>Authenticate using your API token from <a href="https://console.apify.com/settings/integrations" target="_blank" rel="noopener noreferrer">Apify Console</a>:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify login</span><br></div></code></pre></div></div>
<p>Choose "Enter API token manually" and paste your token.</p>
<p>Push the project to the platform:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">apify push</span><br></div></code></pre></div></div>
<p>Now you can configure runs on Apify Platform.</p>
<p>Let’s perform a test run:</p>
<p>Fill in the input parameters:</p>
<p><img decoding="async" loading="lazy" alt="Actor Input" src="https://crawlee.dev/assets/images/input_actor-20bb99df05dea1b2e799d92d6e3750f5.webp" width="918" height="899" class="img_ev3q"></p>
<p>Check that logging works correctly:</p>
<p><img decoding="async" loading="lazy" alt="Actor Log" src="https://crawlee.dev/assets/images/actor_log-c74fa12a02ea0ff9ec3f77cfcb02bc52.webp" width="1526" height="927" class="img_ev3q"></p>
<p>View results in the dataset:</p>
<p><img decoding="async" loading="lazy" alt="Dataset Results" src="https://crawlee.dev/assets/images/actor_results-dca44d296e6897737ef338a19b7b2177.webp" width="1650" height="880" class="img_ev3q"></p>
<p>If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this <a href="https://docs.apify.com/platform/actors/publishing" target="_blank" rel="noopener noreferrer">publishing guide</a> for <a href="https://apify.com/store" target="_blank" rel="noopener noreferrer">Apify Store</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion-and-repository-access">Conclusion and repository access<a href="https://crawlee.dev/blog/scrape-bluesky-using-python#conclusion-and-repository-access" class="hash-link" aria-label="Direct link to Conclusion and repository access" title="Direct link to Conclusion and repository access" translate="no">​</a></h2>
<p>We’ve created an efficient crawler for Bluesky using the official API. If you want to learn more this topic for regular data extraction from Bluesky, I recommend explorin <a href="https://docs.bsky.app/docs/starter-templates/custom-feeds" target="_blank" rel="noopener noreferrer">custom feed generation</a> - I think it opens up some interesting possibilities.</p>
<p>And if you need to quickly create a crawler that can retrieve data for various queries, you now have everything you need.</p>
<p>You can find the complete code in the <a href="https://github.com/Mantisus/bluesky-crawlee" target="_blank" rel="noopener noreferrer">repository</a></p>
<p>If you enjoyed this blog, feel free to support Crawlee for Python by starring the <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">repository</a> or joining the maintainer team.</p>
<p>Have questions or want to discuss implementation details? Join our <a href="https://discord.com/invite/jyEM2PRvMU" target="_blank" rel="noopener noreferrer">Discord</a> - our community of 10,000+ developers is there to help.</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[Crawlee for Python v0.6]]></title>
            <link>https://crawlee.dev/blog/crawlee-for-python-v06</link>
            <guid>https://crawlee.dev/blog/crawlee-for-python-v06</guid>
            <pubDate>Thu, 06 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Announcing the Crawlee for Python v0.6 release.]]></description>
            <content:encoded><![CDATA[<p>Crawlee for Python v0.6 is here, and it's packed with new features and important bug fixes. If you're upgrading from a previous version, please take a moment to review the breaking changes detailed below to ensure a smooth transition.</p>
<p><img decoding="async" loading="lazy" alt="Crawlee for Python v0.6.0" src="https://crawlee.dev/assets/images/crawlee_v060-5cdf895baf62d5ab5beea47ce6502dec.webp" width="1536" height="865" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="getting-started">Getting started<a href="https://crawlee.dev/blog/crawlee-for-python-v06#getting-started" class="hash-link" aria-label="Direct link to Getting started" title="Direct link to Getting started" translate="no">​</a></h2>
<p>You can upgrade to the latest version straight from <a href="https://www.pypi.org/project/crawlee/" target="_blank" rel="noopener noreferrer">PyPI</a>:</p>
<div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">--upgrade</span><span class="token plain"> crawlee</span><br></div></code></pre></div></div>
<p>Check out the full changelog on our <a href="https://www.crawlee.dev/python/docs/changelog#060-2025-03-03" target="_blank" rel="noopener noreferrer">website</a> to see all the details. If you are updating from an older version, make sure to follow our <a href="https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v06" target="_blank" rel="noopener noreferrer">Upgrading to v0.6</a> guide.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="adaptive-playwright-crawler">Adaptive Playwright crawler<a href="https://crawlee.dev/blog/crawlee-for-python-v06#adaptive-playwright-crawler" class="hash-link" aria-label="Direct link to Adaptive Playwright crawler" title="Direct link to Adaptive Playwright crawler" translate="no">​</a></h2>
<p>The new <a href="https://www.crawlee.dev/python/api/class/AdaptivePlaywrightCrawler" target="_blank" rel="noopener noreferrer"><code>AdaptivePlaywrightCrawler</code></a> is a hybrid solution that combines the best of two worlds: full browser rendering with <a href="https://www.playwright.dev/" target="_blank" rel="noopener noreferrer">Playwright</a> and lightweight HTTP-based crawling (using, for example, <a href="https://www.crawlee.dev/python/api/class/BeautifulSoupCrawler" target="_blank" rel="noopener noreferrer"><code>BeautifulSoupCrawler</code></a> or <a href="https://www.crawlee.dev/python/api/class/ParselCrawler" target="_blank" rel="noopener noreferrer"><code>ParselCrawler</code></a>). It automatically switches between the two methods based on real-time analysis of the target page, helping you achieve lower crawl costs and improved performance when crawling a variety of websites.</p>
<p>The example below demonstrates how the <code>AdaptivePlaywrightCrawler</code> can handle both static and dynamic content.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> AdaptivePlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> AdaptivePlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> AdaptivePlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">with_beautifulsoup_static_parser</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        playwright_crawler_specific_kwargs</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'browser_type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'chromium'</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> AdaptivePlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Do some processing using `parsed_content`</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">parsed_content</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">title</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Locate element h2 within 5 seconds</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        h2 </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'h2'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">seconds</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Do stuff with element found by the selector</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">h2</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Find more links and enqueue them.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Save some data.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'Visited url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.crawlee.dev/'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Check out our <a href="https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler" target="_blank" rel="noopener noreferrer">Adaptive Playwright crawler guide</a> for more details on how to use this new crawler.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="browserforge-fingerprints">Browserforge fingerprints<a href="https://crawlee.dev/blog/crawlee-for-python-v06#browserforge-fingerprints" class="hash-link" aria-label="Direct link to Browserforge fingerprints" title="Direct link to Browserforge fingerprints" translate="no">​</a></h2>
<p>To help you avoid detection and blocking, Crawlee now integrates the <a href="https://www.github.com/daijro/browserforge" target="_blank" rel="noopener noreferrer">browserforge</a> library - intelligent browser header &amp; fingerprint generator. This feature simulates real browser behavior by automatically randomizing HTTP headers and fingerprints, making your crawling sessions significantly more resilient against anti-bot measures.</p>
<p>With <a href="https://www.github.com/daijro/browserforge" target="_blank" rel="noopener noreferrer">browserforge</a> fingerprints enabled by default, your crawler sends realistic HTTP headers and user-agent strings. HTTP-based crawlers, which use <a href="https://www.crawlee.dev/python/api/class/HttpxHttpClient" target="_blank" rel="noopener noreferrer"><code>HttpxHttpClient</code></a> by default benefit from these adjustments, while the <a href="https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient" target="_blank" rel="noopener noreferrer"><code>CurlImpersonateHttpClient</code></a> employs its own stealthy techniques. The <a href="https://www.crawlee.dev/python/docs/guides/playwright-crawler" target="_blank" rel="noopener noreferrer"><code>PlaywrightCrawler</code></a> adjusts HTTP headers and browser fingerprints accordingly. Together, these improvements make your crawlers much harder to detect.</p>
<p>Below is an example of using <code>PlaywrightCrawler</code>, which now benefits from the <a href="https://www.github.com/daijro/browserforge" target="_blank" rel="noopener noreferrer">browserforge</a> library:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> PlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># The browserforge fingerprints and headers are used by default.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Crawling URL: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Decode and log the response body, which contains the headers we sent.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">body</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">decode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Response headers: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">headers</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract and log the User-Agent and UA data used in the browser context.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ua </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">evaluate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'() =&gt; window.navigator.userAgent'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ua_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">evaluate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'() =&gt; window.navigator.userAgentData'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Navigator user-agent: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">ua</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Navigator user-agent data: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">ua_data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># The endpoint httpbin.org/headers returns the request headers in the response body.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.httpbin.org/headers'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>For further details on utilizing <a href="https://www.github.com/daijro/browserforge" target="_blank" rel="noopener noreferrer">browserforge</a> to avoid blocking, please refer to our <a href="https://www.crawlee.dev/python/docs/guides/avoid-blocking" target="_blank" rel="noopener noreferrer">Avoid getting blocked guide</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="cli-dependencies">CLI dependencies<a href="https://crawlee.dev/blog/crawlee-for-python-v06#cli-dependencies" class="hash-link" aria-label="Direct link to CLI dependencies" title="Direct link to CLI dependencies" translate="no">​</a></h2>
<p>In v0.6, we've reduced the size of the core package by moving CLI (template creation) dependencies to optional extras. This change reduces the package footprint, keeping the base installation lightweight. To use Crawlee's CLI for creating new projects, simply install the package with the CLI extras.</p>
<p>For example, to create a new project from a template using <code>pipx</code>, run:</p>
<div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pipx run </span><span class="token string" style="color:#e3116c">'crawlee[cli]'</span><span class="token plain"> create my-crawler</span><br></div></code></pre></div></div>
<p>Or with <code>uvx</code>:</p>
<div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">uvx </span><span class="token string" style="color:#e3116c">'crawlee[cli]'</span><span class="token plain"> create my-crawler</span><br></div></code></pre></div></div>
<p>This change ensures that while the core package remains lean, you can still opt in to CLI functionality when bootstrapping new projects.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/crawlee-for-python-v06#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>We are excited to share that Crawlee v0.6 is here. If you have any questions or feedback, please open a <a href="https://www.github.com/apify/crawlee-python/discussions" target="_blank" rel="noopener noreferrer">GitHub discussion</a>. If you encounter any bugs, or have an idea for a new feature, please open a <a href="https://www.github.com/apify/crawlee-python/issues" target="_blank" rel="noopener noreferrer">GitHub issue</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Inside implementing SuperScraper with Crawlee]]></title>
            <link>https://crawlee.dev/blog/superscraper-with-crawlee</link>
            <guid>https://crawlee.dev/blog/superscraper-with-crawlee</guid>
            <pubDate>Wed, 05 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[This article explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.]]></description>
            <content:encoded><![CDATA[<p><a href="https://github.com/apify/super-scraper" target="_blank" rel="noopener noreferrer">SuperScraper</a> is an open-source <a href="https://docs.apify.com/platform/actors" target="_blank" rel="noopener noreferrer">Actor</a> that combines features from various web scraping services, including <a href="https://www.scrapingbee.com/" target="_blank" rel="noopener noreferrer">ScrapingBee</a>, <a href="https://scrapingant.com/" target="_blank" rel="noopener noreferrer">ScrapingAnt</a>, and <a href="https://www.scraperapi.com/" target="_blank" rel="noopener noreferrer">ScraperAPI</a>.</p>
<p>A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately.</p>
<p>This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.</p>
<p><img decoding="async" loading="lazy" alt="Google Maps Data Screenshot" src="https://crawlee.dev/assets/images/superscraper-8d24da63227f97df70998e8900b3a901.webp" width="1152" height="649" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-is-superscraper">What is SuperScraper?<a href="https://crawlee.dev/blog/superscraper-with-crawlee#what-is-superscraper" class="hash-link" aria-label="Direct link to What is SuperScraper?" title="Direct link to What is SuperScraper?" translate="no">​</a></h3>
<p>SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="how-to-enable-standby-mode">How to enable standby mode<a href="https://crawlee.dev/blog/superscraper-with-crawlee#how-to-enable-standby-mode" class="hash-link" aria-label="Direct link to How to enable standby mode" title="Direct link to How to enable standby mode" translate="no">​</a></h3>
<p>To activate standby mode, you must configure the settings so it listens for incoming requests.</p>
<p><img decoding="async" loading="lazy" alt="Activating Actor standby mode" src="https://crawlee.dev/assets/images/actor-standby-9b094dde2615b70afb82685d56c8d74e.webp" width="2504" height="984" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="server-setup">Server setup<a href="https://crawlee.dev/blog/superscraper-with-crawlee#server-setup" class="hash-link" aria-label="Direct link to Server setup" title="Direct link to Server setup" translate="no">​</a></h3>
<p>The project uses Node.js <code>http</code> module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="handling-multiple-crawlers">Handling multiple crawlers<a href="https://crawlee.dev/blog/superscraper-with-crawlee#handling-multiple-crawlers" class="hash-link" aria-label="Direct link to Handling multiple crawlers" title="Direct link to Handling multiple crawlers" translate="no">​</a></h3>
<p>SuperScraper processes user requests using multiple instances of Crawlee’s <a href="https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler"><code>PlaywrightCrawler</code></a>. Since each <code>PlaywrightCrawler</code> instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.</p>
<p>For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration.</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> crawlers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">Map</span><span class="token class-name operator" style="color:#393A34">&lt;</span><span class="token class-name builtin">string</span><span class="token class-name punctuation" style="color:#393A34">,</span><span class="token class-name"> PlaywrightCrawler</span><span class="token class-name operator" style="color:#393A34">&gt;</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed.</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> key </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token constant" style="color:#36acaa">JSON</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">stringify</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">crawlerOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> crawlers</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">has</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">?</span><span class="token plain"> crawlers</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">)</span><span class="token operator" style="color:#393A34">!</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">createAndStartCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">crawlerOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">addRequests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the <code>MemoryStorage</code> client. This approach is used for two key reasons:</p>
<ol>
<li class=""><strong>Performance</strong>: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates.</li>
<li class=""><strong>Isolation</strong>: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously.</li>
</ol>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">export</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token function-variable function" style="color:#d73a49">createAndStartCrawler</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">crawlerOptions</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> CrawlerOptions </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token constant" style="color:#36acaa">DEFAULT_CRAWLER_OPTIONS</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">MemoryStorage</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> persistStorage</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">false</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> queue </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> RequestQueue</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">undefined</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> storageClient</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> client </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> proxyConfig </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">createProxyConfiguration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">crawlerOptions</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">proxyConfigurationOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        keepAlive</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        proxyConfiguration</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> proxyConfig</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        maxRequestRetries</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">4</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requestQueue</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> queue</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler.</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">then</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">warning</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Crawler ended</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> crawlerOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">crawlers</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token constant" style="color:#36acaa">JSON</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">stringify</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">crawlerOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Crawler ready 🚀'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> crawlerOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="mapping-standby-http-requests-to-crawlee-requests">Mapping standby HTTP requests to Crawlee requests<a href="https://crawlee.dev/blog/superscraper-with-crawlee#mapping-standby-http-requests-to-crawlee-requests" class="hash-link" aria-label="Direct link to Mapping standby HTTP requests to Crawlee requests" title="Direct link to Mapping standby HTTP requests to Crawlee requests" translate="no">​</a></h3>
<p>When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as <code>request.uniqueKey</code>.</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> responses </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">Map</span><span class="token class-name operator" style="color:#393A34">&lt;</span><span class="token class-name builtin">string</span><span class="token class-name punctuation" style="color:#393A34">,</span><span class="token class-name"> ServerResponse</span><span class="token class-name operator" style="color:#393A34">&gt;</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p><strong>Saving response objects</strong></p>
<p>The following function stores a response object in the key-value map:</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">export</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">function</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">addResponse</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">string</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> response</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> ServerResponse</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    responses</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> response</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p><strong>Updating crawler logic to store responses</strong></p>
<p>Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object:</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> key </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token constant" style="color:#36acaa">JSON</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">stringify</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">crawlerOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> crawlers</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">has</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">?</span><span class="token plain"> crawlers</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">)</span><span class="token operator" style="color:#393A34">!</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">createAndStartCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">crawlerOptions</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token function" style="color:#d73a49">addResponse</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">uniqueKey</span><span class="token operator" style="color:#393A34">!</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> res</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">requestQueue</span><span class="token operator" style="color:#393A34">!</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">addRequest</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p><strong>Sending scraped data back</strong></p>
<p>Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user:</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">export</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token function-variable function" style="color:#d73a49">sendSuccResponseById</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">string</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> result</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">unknown</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> contentType</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">string</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> res </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> responses</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token operator" style="color:#393A34">!</span><span class="token plain">res</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Response for request </span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">responseId</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c"> not found</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    res</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">writeHead</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">200</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> </span><span class="token string-property property" style="color:#36acaa">'Content-Type'</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> contentType </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    res</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">end</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">result</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    responses</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">delete</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p><strong>Error handling</strong></p>
<p>There is similar logic to send a response back if an error occurs during scraping:</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">export</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token function-variable function" style="color:#d73a49">sendErrorResponseById</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">string</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> result</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">string</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> statusCode</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">number</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">500</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> res </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> responses</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token operator" style="color:#393A34">!</span><span class="token plain">res</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Response for request </span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">responseId</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c"> not found</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    res</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">writeHead</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">statusCode</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> </span><span class="token string-property property" style="color:#36acaa">'Content-Type'</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'application/json'</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    res</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">end</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">result</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    responses</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">delete</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responseId</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p><strong>Adding timeouts during migrations</strong></p>
<p>During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly.</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">export</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token function-variable function" style="color:#d73a49">addTimeoutToAllResponses</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">timeoutInSeconds</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">number</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">60</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> migrationErrorMessage </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        errorMessage</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Actor had to migrate to another server. Please, retry your request.'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> responseKeys </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Object</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">keys</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">responses</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> key </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> responseKeys</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token function" style="color:#d73a49">setTimeout</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token function" style="color:#d73a49">sendErrorResponseById</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">key</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token constant" style="color:#36acaa">JSON</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">stringify</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">migrationErrorMessage</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timeoutInSeconds </span><span class="token operator" style="color:#393A34">*</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="managing-migrations">Managing migrations<a href="https://crawlee.dev/blog/superscraper-with-crawlee#managing-migrations" class="hash-link" aria-label="Direct link to Managing migrations" title="Direct link to Managing migrations" translate="no">​</a></h3>
<p>SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions.</p>
<div class="language-ts codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-ts codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Actor</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">on</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'migrating'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token operator" style="color:#393A34">=&gt;</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token function" style="color:#d73a49">addTimeoutToAllResponses</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">60</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>Users receive clear feedback during server migrations, maintaining stable operation.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="build-your-own">Build your own<a href="https://crawlee.dev/blog/superscraper-with-crawlee#build-your-own" class="hash-link" aria-label="Direct link to Build your own" title="Direct link to Build your own" translate="no">​</a></h3>
<p>This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through <code>PlaywrightCrawler</code> instances while managing request-response cycles efficiently to support diverse scraping needs.</p>
<p>Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements.</p>
<p>To get started, explore the project on <a href="https://github.com/apify/super-scraper" target="_blank" rel="noopener noreferrer">GitHub</a> or learn more about <a href="https://crawlee.dev/">Crawlee</a> to build your own scalable web scraping tools.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Crawlee for Python v0.5]]></title>
            <link>https://crawlee.dev/blog/crawlee-for-python-v05</link>
            <guid>https://crawlee.dev/blog/crawlee-for-python-v05</guid>
            <pubDate>Fri, 10 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Announcing the Crawlee for Python v0.5 release.]]></description>
            <content:encoded><![CDATA[<p>Crawlee for Python v0.5 is now available! This is our biggest release to date, bringing new ported functionality from the <a href="https://github.com/apify/crawlee" target="_blank" rel="noopener noreferrer">Crawlee for JavaScript</a>, brand-new features that are exclusive to the Python library (for now), a new consolidated package structure, and a bunch of bug fixes and further improvements.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="getting-started">Getting started<a href="https://crawlee.dev/blog/crawlee-for-python-v05#getting-started" class="hash-link" aria-label="Direct link to Getting started" title="Direct link to Getting started" translate="no">​</a></h2>
<p>You can upgrade to the latest version straight from <a href="https://pypi.org/project/crawlee/" target="_blank" rel="noopener noreferrer">PyPI</a>:</p>
<div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">--upgrade</span><span class="token plain"> crawlee</span><br></div></code></pre></div></div>
<p>Check out the full changelog on our <a href="https://www.crawlee.dev/python/docs/changelog#050-2025-01-02" target="_blank" rel="noopener noreferrer">website</a> to see all the details. If you are updating from an older version, make sure to follow our <a href="https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v05" target="_blank" rel="noopener noreferrer">Upgrading to v0.5</a> guide for a smooth upgrade.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="new-package-structure">New package structure<a href="https://crawlee.dev/blog/crawlee-for-python-v05#new-package-structure" class="hash-link" aria-label="Direct link to New package structure" title="Direct link to New package structure" translate="no">​</a></h2>
<p>We have introduced a new consolidated package structure. The goal is to streamline the development experience, help you find the crawlers you are looking for faster, and improve the IDE's code suggestions while importing.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="crawlers">Crawlers<a href="https://crawlee.dev/blog/crawlee-for-python-v05#crawlers" class="hash-link" aria-label="Direct link to Crawlers" title="Direct link to Crawlers" translate="no">​</a></h3>
<p>We have grouped all crawler classes (and their corresponding crawling context classes) into a single sub-package called <code>crawlers</code>. Here is a quick example of how the imports have changed:</p>
<div class="language-diff codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-diff codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token deleted-sign deleted prefix deleted" style="color:#d73a49">-</span><span class="token deleted-sign deleted line" style="color:#d73a49"> from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token deleted-sign deleted line" style="color:#d73a49"></span><span class="token inserted-sign inserted prefix inserted" style="color:#36acaa">+</span><span class="token inserted-sign inserted line" style="color:#36acaa"> from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext</span><br></div></code></pre></div></div>
<p>Look how you can see all the crawlers that we have, isn't that cool!</p>
<p><img decoding="async" loading="lazy" alt="Import from crawlers subpackage." src="https://crawlee.dev/assets/images/import_crawlers-32dc36ba69192c5d936cbc8c05a9b946.webp" width="1892" height="804" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="storage-clients">Storage clients<a href="https://crawlee.dev/blog/crawlee-for-python-v05#storage-clients" class="hash-link" aria-label="Direct link to Storage clients" title="Direct link to Storage clients" translate="no">​</a></h3>
<p>Similarly, we have moved all storage client classes under <code>storage_clients</code> sub-package. For instance:</p>
<div class="language-diff codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-diff codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token deleted-sign deleted prefix deleted" style="color:#d73a49">-</span><span class="token deleted-sign deleted line" style="color:#d73a49"> from crawlee.memory_storage_client import MemoryStorageClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token deleted-sign deleted line" style="color:#d73a49"></span><span class="token inserted-sign inserted prefix inserted" style="color:#36acaa">+</span><span class="token inserted-sign inserted line" style="color:#36acaa"> from crawlee.storage_clients import MemoryStorageClient</span><br></div></code></pre></div></div>
<p>This consolidation makes it clearer where each class belongs and ensures that your IDE can provide better autocompletion when you are looking for the right crawler or storage client.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="continued-parity-with-crawlee-js">Continued parity with Crawlee JS<a href="https://crawlee.dev/blog/crawlee-for-python-v05#continued-parity-with-crawlee-js" class="hash-link" aria-label="Direct link to Continued parity with Crawlee JS" title="Direct link to Continued parity with Crawlee JS" translate="no">​</a></h2>
<p>We are constantly working toward feature parity with our JavaScript library, <a href="https://github.com/apify/crawlee" target="_blank" rel="noopener noreferrer">Crawlee JS</a>. With v0.5, we have brought over more functionality:</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="html-to-text-context-helper">HTML to text context helper<a href="https://crawlee.dev/blog/crawlee-for-python-v05#html-to-text-context-helper" class="hash-link" aria-label="Direct link to HTML to text context helper" title="Direct link to HTML to text context helper" translate="no">​</a></h3>
<p>The <code>html_to_text</code> crawling context helper simplifies extracting text from an HTML page by automatically removing all tags and returning only the raw text content. It's available in the <a href="https://www.crawlee.dev/python/api/class/ParselCrawlingContext#html_to_text" target="_blank" rel="noopener noreferrer"><code>ParselCrawlingContext</code></a> and <a href="https://www.crawlee.dev/python/api/class/BeautifulSoupCrawlingContext#html_to_text" target="_blank" rel="noopener noreferrer"><code>BeautifulSoupCrawlingContext</code></a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Crawling: %s'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        text </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">html_to_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Continue with the processing...</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>In this example, we use a <a href="https://www.crawlee.dev/python/api/class/ParselCrawler" target="_blank" rel="noopener noreferrer"><code>ParselCrawler</code></a> to fetch a webpage, then invoke <code>context.html_to_text()</code> to extract clean text for further processing.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="use-state">Use state<a href="https://crawlee.dev/blog/crawlee-for-python-v05#use-state" class="hash-link" aria-label="Direct link to Use state" title="Direct link to Use state" translate="no">​</a></h3>
<p>The <a href="https://www.crawlee.dev/python/api/class/UseStateFunction" target="_blank" rel="noopener noreferrer"><code>use_state</code></a> crawling context helper makes it simple to create and manage persistent state values within your crawler. It ensures that all state values are automatically persisted. It enables you to maintain data across different crawler runs, restarts, and failures. It acts as a convenient abstraction for interaction with <a href="https://www.crawlee.dev/python/api/class/KeyValueStore" target="_blank" rel="noopener noreferrer"><code>KeyValueStore</code></a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">configuration </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Configuration</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a crawler with purge_on_start disabled to retain state across runs.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">Configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">purge_on_start</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Crawling </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Retrieve or initialize the state with a default value.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        state </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">use_state</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'state'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> default_value</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'runs'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Increment the run count.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        state</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'runs'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">+=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a request with always_enqueue enabled to bypass deduplication and ensure it is processed.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    request </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'https://crawlee.dev/'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> always_enqueue</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the start request.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Fetch the persisted state from the key-value store.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    kvs </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_key_value_store</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    state </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> kvs</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_auto_saved_value</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'state'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Final state after run: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">state</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Please note that the <code>use_state</code> is an experimental feature. Its behavior and interface may evolve in future versions.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="brand-new-features">Brand new features<a href="https://crawlee.dev/blog/crawlee-for-python-v05#brand-new-features" class="hash-link" aria-label="Direct link to Brand new features" title="Direct link to Brand new features" translate="no">​</a></h2>
<p>In addition to porting features from JS, we are introducing new, Python-first functionalities that will eventually make their way into Crawlee JS in the coming months.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="crawlers-stop-method">Crawler's stop method<a href="https://crawlee.dev/blog/crawlee-for-python-v05#crawlers-stop-method" class="hash-link" aria-label="Direct link to Crawler's stop method" title="Direct link to Crawler's stop method" translate="no">​</a></h3>
<p>The <a href="https://www.crawlee.dev/python/api/class/BasicCrawler" target="_blank" rel="noopener noreferrer"><code>BasicCrawler</code></a>, and by extension, all crawlers that inherit from it, now has a <a href="https://www.crawlee.dev/python/api/class/BasicCrawler#stop" target="_blank" rel="noopener noreferrer"><code>stop</code></a> method. This makes it easy to halt the crawling when a specific condition is met, for instance, if you have found the data you were looking for.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Crawling: %s'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Extract and enqueue links from the page.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        title </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">css</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'title::text'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Condition when you want to stop the crawler, e.g. you</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># have found what you were looking for.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Crawlee for Python'</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> title</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Condition met, stopping the crawler.'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">stop</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="request-loaders">Request loaders<a href="https://crawlee.dev/blog/crawlee-for-python-v05#request-loaders" class="hash-link" aria-label="Direct link to Request loaders" title="Direct link to Request loaders" translate="no">​</a></h3>
<p>There are new classes <a href="https://www.crawlee.dev/python/api/class/RequestLoader" target="_blank" rel="noopener noreferrer"><code>RequestLoader</code></a>, <a href="https://www.crawlee.dev/python/api/class/RequestManager" target="_blank" rel="noopener noreferrer"><code>RequestManager</code></a> and <a href="https://www.crawlee.dev/python/api/class/RequestManagerTandem" target="_blank" rel="noopener noreferrer"><code>RequestManagerTandem</code></a> that manage how Crawlee accesses and stores requests. They allow you to use other component (service) as a source for requests and optionally you can combine it with a <a href="https://www.crawlee.dev/python/api/class/RequestQueue" target="_blank" rel="noopener noreferrer"><code>RequestQueue</code></a>. They let you plug in any request source, and combine the external data sources with Crawlee's standard <code>RequestQueue</code>.</p>
<p>You can learn more about these new features in the <a href="https://www.crawlee.dev/python/docs/guides/request-loaders" target="_blank" rel="noopener noreferrer">Request loaders guide</a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request_loaders </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> RequestList</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> RequestManagerTandem</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storages </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> RequestQueue</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    rl </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> RequestList</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'https://apify.com'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Long list of URLs...</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    rq </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> RequestQueue</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Combine them into a single request source.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    tandem </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> RequestManagerTandem</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">rl</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> rq</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">request_manager</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">tandem</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Crawling </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># ...</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>In this example we combine a <a href="https://www.crawlee.dev/python/api/class/RequestList" target="_blank" rel="noopener noreferrer"><code>RequestList</code></a> with a <a href="https://www.crawlee.dev/python/api/class/RequestQueue" target="_blank" rel="noopener noreferrer"><code>RequestQueue</code></a>. However, instead of the <code>RequestList</code> you can use any other class that implements the <a href="https://www.crawlee.dev/python/api/class/RequestLoader" target="_blank" rel="noopener noreferrer"><code>RequestLoader</code></a> interface to suit your specific requirements.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="service-locator">Service locator<a href="https://crawlee.dev/blog/crawlee-for-python-v05#service-locator" class="hash-link" aria-label="Direct link to Service locator" title="Direct link to Service locator" translate="no">​</a></h3>
<p>The <a href="https://www.crawlee.dev/python/api/class/ServiceLocator" target="_blank" rel="noopener noreferrer"><code>ServiceLocator</code></a> is primarily an internal mechanism for managing the services that Crawlee depends on. Specifically, the <a href="https://www.crawlee.dev/python/api/class/ServiceLocator" target="_blank" rel="noopener noreferrer"><code>Configuration</code></a>, <a href="https://www.crawlee.dev/python/api/class/ServiceLocator" target="_blank" rel="noopener noreferrer"><code>StorageClient</code></a>, and <a href="https://www.crawlee.dev/python/api/class/ServiceLocator" target="_blank" rel="noopener noreferrer"><code>EventManager</code></a>. By swapping out these components, you can adapt Crawlee to suit different runtime environments.</p>
<p>You can use the service locator explicitly:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> service_locator</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">configuration </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Configuration</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">events </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> LocalEventManager</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storage_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> MemoryStorageClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    service_locator</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">set_configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    service_locator</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">set_storage_client</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">MemoryStorageClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    service_locator</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">set_event_manager</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">LocalEventManager</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># ...</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Or pass the services directly to the crawler instance, and they will be set under the hood:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">configuration </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Configuration</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">events </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> LocalEventManager</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">storage_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> MemoryStorageClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">Configuration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        storage_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">MemoryStorageClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        event_manager</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">LocalEventManager</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># ...</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/crawlee-for-python-v05#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>We are excited to share that Crawlee v0.5 is here. If you have any questions or feedback, please open a <a href="https://github.com/apify/crawlee-python/discussions" target="_blank" rel="noopener noreferrer">GitHub discussion</a>. If you encounter any bugs, or have an idea for a new feature, please open a <a href="https://github.com/apify/crawlee-python/issues" target="_blank" rel="noopener noreferrer">GitHub issue</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to scrape Crunchbase using Python in 2024 (Easy Guide)]]></title>
            <link>https://crawlee.dev/blog/scrape-crunchbase-python</link>
            <guid>https://crawlee.dev/blog/scrape-crunchbase-python</guid>
            <pubDate>Fri, 03 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape Crunchbase using Crawlee for Python]]></description>
            <content:encoded><![CDATA[<p>Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective <a href="https://www.crunchbase.com/" target="_blank" rel="noopener noreferrer">Crunchbase</a> scraper in Python that gets you the data you need.</p>
<p>Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format.</p>
<p>By the end of this blog, we'll explore three different ways to extract data from Crunchbase using <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer"><code>Crawlee for Python</code></a>. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly <a href="https://www.crawlee.dev/blog/web-scraping-tips#1-choosing-a-data-source-for-the-project" target="_blank" rel="noopener noreferrer">choose the right data source</a>.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on <a href="https://discord.com/invite/jyEM2PRvMU" target="_blank" rel="noopener noreferrer">Discord</a> to share your experiences and blog ideas - we value these contributions from developers like you.</p></div></div>
<p><img decoding="async" loading="lazy" alt="How to Scrape Crunchbase Using Python" src="https://crawlee.dev/assets/images/scrape_crunchbase-28a71b5380492fe6618bbd9c90989543.webp" width="1152" height="649" class="img_ev3q"></p>
<p>Key steps we'll cover:</p>
<ol>
<li class="">Project setup</li>
<li class="">Choosing the data source</li>
<li class="">Implementing sitemap-based crawler</li>
<li class="">Analysis of search-based approach and its limitations</li>
<li class="">Implementing the official API crawler</li>
<li class="">Conclusion and repository access</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites">Prerequisites<a href="https://crawlee.dev/blog/scrape-crunchbase-python#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h2>
<ul>
<li class="">Python 3.9 or higher</li>
<li class="">Familiarity with web scraping concepts</li>
<li class="">Crawlee for Python <code>v0.5.0</code></li>
<li class="">poetry <code>v2.0</code> or higher</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="project-setup">Project setup<a href="https://crawlee.dev/blog/scrape-crunchbase-python#project-setup" class="hash-link" aria-label="Direct link to Project setup" title="Direct link to Project setup" translate="no">​</a></h3>
<p>Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (<code>Playwright</code> and <code>Beautifulsoup</code>), so we'll set up the project manually.</p>
<ol>
<li class="">
<p>Install <a href="https://python-poetry.org/" target="_blank" rel="noopener noreferrer"><code>Poetry</code></a></p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pipx </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> poetry</span><br></div></code></pre></div></div>
</li>
<li class="">
<p>Create and navigate to the project folder.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">mkdir</span><span class="token plain"> crunchbase-crawlee </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token plain"> </span><span class="token builtin class-name">cd</span><span class="token plain"> crunchbase-crawlee</span><br></div></code></pre></div></div>
</li>
<li class="">
<p>Initialize the project using Poetry, leaving all fields empty.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">poetry init</span><br></div></code></pre></div></div>
<p>When prompted:</p>
<ul>
<li class="">For "Compatible Python versions", enter: <code>&gt;={your Python version},&lt;4.0</code>
(For example, if you're using Python 3.10, enter: <code>&gt;=3.10,&lt;4.0</code>)</li>
<li class="">Leave all other fields empty by pressing Enter</li>
<li class="">Confirm the generation by typing "yes"</li>
</ul>
</li>
<li class="">
<p>Add and install Crawlee with necessary dependencies to your project using <code>Poetry.</code></p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">poetry </span><span class="token function" style="color:#d73a49">add</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">parsel,curl-impersonate</span><span class="token punctuation" style="color:#393A34">]</span><br></div></code></pre></div></div>
</li>
<li class="">
<p>Complete the project setup by creating the standard file structure for <code>Crawlee for Python</code> projects.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">mkdir</span><span class="token plain"> crunchbase-crawlee </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">touch</span><span class="token plain"> crunchbase-crawlee/</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">__init__.py,__main__.py,main.py,routes.py</span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
</li>
</ol>
<p>After setting up the basic project structure, we can explore different methods of obtaining data from Crunchbase.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="choosing-the-data-source">Choosing the data source<a href="https://crawlee.dev/blog/scrape-crunchbase-python#choosing-the-data-source" class="hash-link" aria-label="Direct link to Choosing the data source" title="Direct link to Choosing the data source" translate="no">​</a></h3>
<p>While we can extract target data directly from the <a href="https://www.crunchbase.com/organization/apify" target="_blank" rel="noopener noreferrer">company page</a>, we need to choose the best way to navigate the site.</p>
<p>A careful examination of Crunchbase's structure shows that we have three main options for obtaining data:</p>
<ol>
<li class=""><a href="https://www.crunchbase.com/www-sitemaps/sitemap-index.xml" target="_blank" rel="noopener noreferrer"><code>Sitemap</code></a> - for complete site traversal.</li>
<li class=""><a href="https://www.crunchbase.com/discover/organization.companies" target="_blank" rel="noopener noreferrer"><code>Search</code></a> - for targeted data collection.</li>
<li class=""><a href="https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-getting-started" target="_blank" rel="noopener noreferrer">Official API</a> - recommended method.</li>
</ol>
<p>Let's examine each of these approaches in detail.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="scraping-crunchbase-using-sitemap-and-crawlee-for-python">Scraping Crunchbase using sitemap and Crawlee for Python<a href="https://crawlee.dev/blog/scrape-crunchbase-python#scraping-crunchbase-using-sitemap-and-crawlee-for-python" class="hash-link" aria-label="Direct link to Scraping Crunchbase using sitemap and Crawlee for Python" title="Direct link to Scraping Crunchbase using sitemap and Crawlee for Python" translate="no">​</a></h2>
<p><code>Sitemap</code> is a standard way of site navigation used by crawlers like <a href="https://google.com/" target="_blank" rel="noopener noreferrer"><code>Google</code></a>, <a href="https://ahrefs.com/" target="_blank" rel="noopener noreferrer"><code>Ahrefs</code></a>, and other search engines. All crawlers must follow the rules described in <a href="https://www.crunchbase.com/robots.txt" target="_blank" rel="noopener noreferrer"><code>robots.txt</code></a>.</p>
<p>Let's look at the structure of Crunchbase's Sitemap:</p>
<p><img decoding="async" loading="lazy" alt="Sitemap first lvl" src="https://crawlee.dev/assets/images/sitemap_lvl_one-553a6b9df5c5d3c35a8987878456fe7b.webp" width="1335" height="994" class="img_ev3q"></p>
<p>As you can see, links to organization pages are located inside second-level <code>Sitemap</code> files, which are compressed using <code>gzip</code>.</p>
<p>The structure of one of these files looks like this:</p>
<p><img decoding="async" loading="lazy" alt="Sitemap second lvl" src="https://crawlee.dev/assets/images/sitemap_lvl_two-8f3213f305713ebf8bf91b32febfa234.webp" width="1374" height="919" class="img_ev3q"></p>
<p>The <code>lastmod</code> field is particularly important here. It allows tracking which companies have updated their information since the previous data collection. This is especially useful for regular data updates.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-configuring-the-crawler-for-scraping">1. Configuring the crawler for scraping<a href="https://crawlee.dev/blog/scrape-crunchbase-python#1-configuring-the-crawler-for-scraping" class="hash-link" aria-label="Direct link to 1. Configuring the crawler for scraping" title="Direct link to 1. Configuring the crawler for scraping" translate="no">​</a></h3>
<p>To work with the site, we'll use <a href="https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient" target="_blank" rel="noopener noreferrer"><code>CurlImpersonateHttpClient</code></a>, which impersonates a <code>Safari</code> browser. While this choice might seem unexpected for working with a sitemap, it's necessitated by Crunchbase's protection features.</p>
<p>The reason is that Crunchbase uses <a href="https://www.cloudflare.com/" target="_blank" rel="noopener noreferrer">Cloudflare</a> to protect against automated access. This is clearly visible when analyzing traffic on a company page:</p>
<p><img decoding="async" loading="lazy" alt="Cloudflare Link" src="https://crawlee.dev/assets/images/cloudflare_link-bf8b6ba2c873ccb31463258e5964e39b.webp" width="1919" height="995" class="img_ev3q"></p>
<p>An interesting feature is that <code>challenges.cloudflare</code> is executed after loading the document with data. This means we receive the data first, and only then JavaScript checks if we're a bot. If our HTTP client's fingerprint is sufficiently similar to a real browser, we'll successfully receive the data.</p>
<p>Cloudflare <a href="https://developers.cloudflare.com/waf/custom-rules/use-cases/allow-traffic-from-verified-bots/" target="_blank" rel="noopener noreferrer">also analyzes traffic at the sitemap level</a>. If our crawler doesn't look legitimate, access will be blocked. That's why we impersonate a real browser.</p>
<p>To prevent blocks due to overly aggressive crawling, we'll configure <a href="https://www.crawlee.dev/python/api/class/ConcurrencySettings" target="_blank" rel="noopener noreferrer"><code>ConcurrencySettings</code></a>.</p>
<p>When scaling this approach, you'll likely need proxies. Detailed information about proxy setup can be found in the <a href="https://www.crawlee.dev/python/docs/guides/proxy-management" target="_blank" rel="noopener noreferrer">documentation</a>.</p>
<p>We'll save our scraping results in <code>JSON</code> format. Here's how the basic crawler configuration looks:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> HttpHeaders</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CurlImpersonateHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    concurrency_settings </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_concurrency</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    http_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> CurlImpersonateHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        impersonate</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'safari17_0'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HttpHeaders</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'accept-language'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'en'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'accept-encoding'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'gzip, deflate, br, zstd'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ParselCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_request_retries</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">concurrency_settings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">http_client</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">30</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.crunchbase.com/www-sitemaps/sitemap-index.xml'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'crunchbase_data.json'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-implementing-sitemap-navigation">2. Implementing sitemap navigation<a href="https://crawlee.dev/blog/scrape-crunchbase-python#2-implementing-sitemap-navigation" class="hash-link" aria-label="Direct link to 2. Implementing sitemap navigation" title="Direct link to 2. Implementing sitemap navigation" translate="no">​</a></h3>
<p>Sitemap navigation happens in two stages. In the first stage, we need to get a list of all files containing organization information:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ParselCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Default request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'default_handler processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'sitemap'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> url </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">xpath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'//loc[contains(., "sitemap-organizations")]/text()'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getall</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Since this is a tutorial, I don't want to upload more than one sitemap link</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> limit</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>In the second stage, we process second-level sitemap files stored in <code>gzip</code> format. This requires a special approach as the data needs to be decompressed first:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> gzip </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> decompress</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> parsel </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Selector</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'sitemap'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">sitemap_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Sitemap gzip request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'sitemap_handler processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> decompress</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    selector </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">decode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'company'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> url </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">xpath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'//loc/text()'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getall</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-extracting-and-saving-data">3. Extracting and saving data<a href="https://crawlee.dev/blog/scrape-crunchbase-python#3-extracting-and-saving-data" class="hash-link" aria-label="Direct link to 3. Extracting and saving data" title="Direct link to 3. Extracting and saving data" translate="no">​</a></h3>
<p>Each company page contains a large amount of information. For demonstration purposes, we'll focus on the main fields: <code>Company Name</code>, <code>Short Description</code>, <code>Website</code>, and <code>Location</code>.</p>
<p>One of Crunchbase's advantages is that all data is stored in <code>JSON</code> format within the page:</p>
<p><img decoding="async" loading="lazy" alt="Company Data" src="https://crawlee.dev/assets/images/data_json-7c79a7387510a995f29ba5ce157f0845.webp" width="1919" height="841" class="img_ev3q"></p>
<p>This significantly simplifies data extraction - we only need to use one <code>Xpath</code> selector to get the <code>JSON</code>, and then apply <a href="https://jmespath.org/" target="_blank" rel="noopener noreferrer"><code>jmespath</code></a> to extract the needed fields:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># routes.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'company'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">company_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ParselCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Company request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'company_handler processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    json_selector </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">xpath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'//*[@id="ng-state"]/text()'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Company Name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> json_selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">jmespath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'HttpState.*.data[].properties.identifier.value'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Short Description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> json_selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">jmespath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'HttpState.*.data[].properties.short_description'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Website'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> json_selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">jmespath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'HttpState.*.data[].cards.company_about_fields2.website.value'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Location'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'; '</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                json_selector</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">jmespath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">'HttpState.*.data[].cards.company_about_fields2.location_identifiers[].value'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getall</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The collected data is saved in <code>Crawlee for Python</code>'s internal storage using the <code>context.push_data</code> method. When the crawler finishes, we export all collected data to a JSON file:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'crunchbase_data.json'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-running-the-project">4. Running the project<a href="https://crawlee.dev/blog/scrape-crunchbase-python#4-running-the-project" class="hash-link" aria-label="Direct link to 4. Running the project" title="Direct link to 4. Running the project" translate="no">​</a></h3>
<p>With all components in place, we need to create an entry point for our crawler:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># __main__.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">main </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> main</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Execute the crawler using Poetry:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">poetry run python </span><span class="token parameter variable" style="color:#36acaa">-m</span><span class="token plain"> crunchbase-crawlee</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-finally-characteristics-of-using-the-sitemap-crawler">5. Finally, characteristics of using the sitemap crawler<a href="https://crawlee.dev/blog/scrape-crunchbase-python#5-finally-characteristics-of-using-the-sitemap-crawler" class="hash-link" aria-label="Direct link to 5. Finally, characteristics of using the sitemap crawler" title="Direct link to 5. Finally, characteristics of using the sitemap crawler" translate="no">​</a></h3>
<p>The sitemap approach has its distinct advantages and limitations. It's ideal in the following cases:</p>
<ul>
<li class="">When you need to collect data about all companies on the platform</li>
<li class="">When there are no specific company selection criteria</li>
<li class="">If you have sufficient time and computational resources</li>
</ul>
<p>However, there are significant limitations to consider:</p>
<ul>
<li class="">Almost no ability to filter data during collection</li>
<li class="">Requires constant monitoring of Cloudflare blocks</li>
<li class="">Scaling the solution requires proxy servers, which increases project costs</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="using-search-for-scraping-crunchbase">Using search for scraping Crunchbase<a href="https://crawlee.dev/blog/scrape-crunchbase-python#using-search-for-scraping-crunchbase" class="hash-link" aria-label="Direct link to Using search for scraping Crunchbase" title="Direct link to Using search for scraping Crunchbase" translate="no">​</a></h2>
<p>The limitations of the sitemap approach might point to search as the next solution. However, Crunchbase applies tighter security measures to its search functionality compared to its public pages.</p>
<p>The key difference lies in how Cloudflare protection works. While we receive data before the <code>challenges.cloudflare</code> check when accessing a company page, the search API requires valid <code>cookies</code> that have passed this check.</p>
<p>Let's verify this in practice. Open the following link in Incognito mode:</p>
<div class="language-plaintext codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-plaintext codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">&lt;https://www.crunchbase.com/v4/data/autocompletes?query=Ap&amp;collection_ids=organizations&amp;limit=25&amp;source=topSearch&gt;</span><br></div></code></pre></div></div>
<p>When analyzing the traffic, we'll see the following pattern:</p>
<p><img decoding="async" loading="lazy" alt="Search Protect" src="https://crawlee.dev/assets/images/search_protect-3b4a1a1934d54c12ac210217919b8b88.webp" width="1916" height="995" class="img_ev3q"></p>
<p>The sequence of events here is:</p>
<ol>
<li class="">First, the page is blocked with code <code>403</code></li>
<li class="">Then the <code>challenges.cloudflare</code> check is performed</li>
<li class="">Only after successfully passing the check do we receive data with code <code>200</code></li>
</ol>
<p>Automating this process would require a <code>headless</code> browser capable of bypassing <a href="https://www.cloudflare.com/application-services/products/turnstile/" target="_blank" rel="noopener noreferrer"><code>Cloudflare Turnstile</code></a>. The current version of <code>Crawlee for Python</code> (v0.5.0) doesn't provide this functionality, although it's planned for future development.</p>
<p>You can extend the capabilities of Crawlee for Python by integrating <a href="https://camoufox.com/" target="_blank" rel="noopener noreferrer"><code>Camoufox</code></a> following this <a href="https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-camoufox" target="_blank" rel="noopener noreferrer">example.</a></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="working-with-the-official-crunchbase-api">Working with the official Crunchbase API<a href="https://crawlee.dev/blog/scrape-crunchbase-python#working-with-the-official-crunchbase-api" class="hash-link" aria-label="Direct link to Working with the official Crunchbase API" title="Direct link to Working with the official Crunchbase API" translate="no">​</a></h2>
<p>Crunchbase provides a <a href="https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-using-api" target="_blank" rel="noopener noreferrer">free API</a> with basic functionality. Paid subscription users get expanded data access. Complete documentation for available endpoints can be found in the <a href="https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api" target="_blank" rel="noopener noreferrer">official API specification</a>.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-setting-up-api-access">1. Setting up API access<a href="https://crawlee.dev/blog/scrape-crunchbase-python#1-setting-up-api-access" class="hash-link" aria-label="Direct link to 1. Setting up API access" title="Direct link to 1. Setting up API access" translate="no">​</a></h3>
<p>To start working with the API, follow these steps:</p>
<ol>
<li class=""><a href="https://www.crunchbase.com/register" target="_blank" rel="noopener noreferrer">Create a Crunchbase account</a></li>
<li class="">Go to the Integrations section</li>
<li class="">Create a Crunchbase Basic API key</li>
</ol>
<p>Although the documentation states that key activation may take up to an hour, it usually starts working immediately after creation.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-configuring-the-crawler-for-api-work">2. Configuring the crawler for API work<a href="https://crawlee.dev/blog/scrape-crunchbase-python#2-configuring-the-crawler-for-api-work" class="hash-link" aria-label="Direct link to 2. Configuring the crawler for API work" title="Direct link to 2. Configuring the crawler for API work" translate="no">​</a></h3>
<p>An important API feature is the limit - no more than 200 requests per minute, but in the free version, this number is significantly lower. Taking this into account, let's configure <a href="https://www.crawlee.dev/python/api/class/ConcurrencySettings" target="_blank" rel="noopener noreferrer"><code>ConcurrencySettings</code></a>. Since we're working with the official API, we don't need to mask our HTTP client. We'll use the standard <a href="https://www.crawlee.dev/python/api/class/HttpxHttpClient" target="_blank" rel="noopener noreferrer">'HttpxHttpClient'</a> with preset headers.</p>
<p>First, let's save the API key in an environment variable:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token builtin class-name">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:#36acaa">CRUNCHBASE_TOKEN</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">YOUR KEY</span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Here's how the crawler configuration for working with the API looks:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> os</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpxHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> HttpHeaders</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">CRUNCHBASE_TOKEN </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> os</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getenv</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'CRUNCHBASE_TOKEN'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    concurrency_settings </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">60</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    http_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> HttpxHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HttpHeaders</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'accept-encoding'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'gzip, deflate, br, zstd'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'X-cb-user-key'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> CRUNCHBASE_TOKEN</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> HttpCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">concurrency_settings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">http_client</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">30</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://api.crunchbase.com/api/v4/autocompletes?query=apify&amp;collection_ids=organizations&amp;limit=25'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'crunchbase_data.json'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-processing-search-results">3. Processing search results<a href="https://crawlee.dev/blog/scrape-crunchbase-python#3-processing-search-results" class="hash-link" aria-label="Direct link to 3. Processing search results" title="Direct link to 3. Processing search results" translate="no">​</a></h3>
<p>For working with the API, we'll need two main endpoints:</p>
<ol>
<li class=""><a href="https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Autocomplete/get_autocompletes" target="_blank" rel="noopener noreferrer">get_autocompletes</a> - for searching</li>
<li class=""><a href="https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Entity/get_entities_organizations__entity_id_" target="_blank" rel="noopener noreferrer">get_entities_organizations__entity_id</a> - for getting data</li>
</ol>
<p>First, let's implement search results processing:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> json</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawlers </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Default request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'default_handler processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> entity </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'entities'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        permalink </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> entity</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'identifier'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'permalink'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">append</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                url</span><span class="token operator" style="color:#393A34">=</span><span class="token string-interpolation string" style="color:#e3116c">f'https://api.crunchbase.com/api/v4/entities/organizations/</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">permalink</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">?field_ids=short_description%2Clocation_identifiers%2Cwebsite_url'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'company'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-extracting-company-data">4. Extracting company data<a href="https://crawlee.dev/blog/scrape-crunchbase-python#4-extracting-company-data" class="hash-link" aria-label="Direct link to 4. Extracting company data" title="Direct link to 4. Extracting company data" translate="no">​</a></h3>
<p>After getting the list of companies, we extract detailed information about each one:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'company'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">company_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Company request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'company_handler processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Company Name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'identifier'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'value'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Short Description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'short_description'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Website'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'website_url'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Location'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'; '</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">item</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'value'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'location_identifiers'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-advanced-location-based-search">5. Advanced location-based search<a href="https://crawlee.dev/blog/scrape-crunchbase-python#5-advanced-location-based-search" class="hash-link" aria-label="Direct link to 5. Advanced location-based search" title="Direct link to 5. Advanced location-based search" translate="no">​</a></h3>
<p>If you need more flexible search capabilities, the API provides a special <a href="https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Search/post_searches_organizations" target="_blank" rel="noopener noreferrer"><code>search</code></a> endpoint. Here's an example of searching for all companies in Prague:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">payload </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">'field_ids'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'identifier'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'location_identifiers'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'short_description'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'website_url'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">'limit'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">200</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">'order'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'field_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'rank_org'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'sort'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'asc'</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">'query'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'field_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'location_identifiers'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'operator_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'includes'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'predicate'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'values'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'e0b951dc-f710-8754-ddde-5ef04dddd9f8'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'field_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'facet_ids'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'operator_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'includes'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'predicate'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'values'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'company'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">serialiazed_payload </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">dumps</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">payload</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            url</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'https://api.crunchbase.com/api/v4/searches/organizations'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            method</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'POST'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            payload</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">serialiazed_payload</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            use_extended_unique_key</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HttpHeaders</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'Content-Type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'application/json'</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'search'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>For processing search results and pagination, we use the following handler:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'search'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">search_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Search results handler with pagination support."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'search_handler processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    last_entity </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    results </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> entity </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'entities'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        last_entity </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> entity</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'uuid'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        results</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">append</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'Company Name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> entity</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'identifier'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'value'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'Short Description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> entity</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'short_description'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'Website'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> entity</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'website_url'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'Location'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'; '</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">item</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'value'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> entity</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'properties'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'location_identifiers'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> results</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">results</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> last_entity</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        payload </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">payload</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        payload</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'after_id'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> last_entity</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        payload </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">dumps</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">payload</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    url</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'https://api.crunchbase.com/api/v4/searches/organizations'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    method</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'POST'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    payload</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">payload</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    use_extended_unique_key</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HttpHeaders</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'Content-Type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'application/json'</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'search'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="6-finally-free-api-limitations">6. Finally, free API limitations<a href="https://crawlee.dev/blog/scrape-crunchbase-python#6-finally-free-api-limitations" class="hash-link" aria-label="Direct link to 6. Finally, free API limitations" title="Direct link to 6. Finally, free API limitations" translate="no">​</a></h3>
<p>The free version of the API has significant limitations:</p>
<ul>
<li class="">Limited set of available endpoints</li>
<li class="">Autocompletes function only works for company searches</li>
<li class="">Not all data fields are accessible</li>
<li class="">Limited search filtering capabilities</li>
</ul>
<p>Consider a paid subscription for production-level work. The API provides the most reliable way to access Crunchbase data, even with its rate constraints.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="whats-your-best-path-forward">What’s your best path forward?<a href="https://crawlee.dev/blog/scrape-crunchbase-python#whats-your-best-path-forward" class="hash-link" aria-label="Direct link to What’s your best path forward?" title="Direct link to What’s your best path forward?" translate="no">​</a></h2>
<p>We've explored three different approaches to obtaining data from Crunchbase:</p>
<ol>
<li class=""><strong>Sitemap</strong> - for large-scale data collection</li>
<li class=""><strong>Search</strong> - difficult to automate due to Cloudflare protection</li>
<li class=""><strong>Official API</strong> - the most reliable solution for commercial projects</li>
</ol>
<p>Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version.</p>
<p>The complete source code is available in my <a href="https://github.com/Mantisus/crunchbase-crawlee" target="_blank" rel="noopener noreferrer">repository</a>. Have questions or want to discuss implementation details? Join our <a href="https://discord.com/invite/jyEM2PRvMU" target="_blank" rel="noopener noreferrer">Discord</a> - our community of developers is there to help.</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[How to scrape Google Maps data using Python]]></title>
            <link>https://crawlee.dev/blog/scrape-google-maps</link>
            <guid>https://crawlee.dev/blog/scrape-google-maps</guid>
            <pubDate>Fri, 13 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape google maps data using Crawlee for Python]]></description>
            <content:encoded><![CDATA[<p>Millions of people use Google Maps daily, leaving behind a goldmine of data just waiting to be analyzed. In this guide, I'll show you how to build a reliable scraper using Crawlee and Python to extract locations, ratings, and reviews from Google Maps, all while handling its dynamic content challenges.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">discord channel</a>.</p></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-data-will-we-extract-from-google-maps">What data will we extract from Google Maps?<a href="https://crawlee.dev/blog/scrape-google-maps#what-data-will-we-extract-from-google-maps" class="hash-link" aria-label="Direct link to What data will we extract from Google Maps?" title="Direct link to What data will we extract from Google Maps?" translate="no">​</a></h2>
<p>We’ll collect information about hotels in a specific city. You can also customize your search to meet your requirements. For example, you might search for "hotels near me", "5-star hotels in Bombay", or other similar queries.</p>
<p><img decoding="async" loading="lazy" alt="Google Maps Data Screenshot" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-data-to-scrape-00e7e4e3498679b8a7611eafd0a1bfbe.webp" width="1906" height="879" class="img_ev3q"></p>
<p>We’ll extract important data, including the hotel name, rating, review count, price, a link to the hotel page on Google Maps, and all available amenities. Here’s an example of what the extracted data will look like:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Vividus Hotels, Bangalore"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"rating"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"4.3"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"reviews"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"633"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"price"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"₹3,667"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"amenities"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Pool available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Free breakfast available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Free Wi-Fi available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Free parking available"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"link"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..."</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="building-a-google-maps-scraper">Building a Google Maps scraper<a href="https://crawlee.dev/blog/scrape-google-maps#building-a-google-maps-scraper" class="hash-link" aria-label="Direct link to Building a Google Maps scraper" title="Direct link to Building a Google Maps scraper" translate="no">​</a></h2>
<p>Let's build a Google Maps scraper step-by-step.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Crawlee requires Python 3.9 or later.</p></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-setting-up-your-environment">1. Setting up your environment<a href="https://crawlee.dev/blog/scrape-google-maps#1-setting-up-your-environment" class="hash-link" aria-label="Direct link to 1. Setting up your environment" title="Direct link to 1. Setting up your environment" translate="no">​</a></h3>
<p>First, let's set up everything you’ll need to run the scraper. Open your terminal and run these commands:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Create and activate a virtual environment</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">python </span><span class="token parameter variable" style="color:#36acaa">-m</span><span class="token plain"> venv google-maps-scraper</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Windows:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">.</span><span class="token punctuation" style="color:#393A34">\</span><span class="token plain">google-maps-scraper</span><span class="token punctuation" style="color:#393A34">\</span><span class="token plain">Scripts</span><span class="token punctuation" style="color:#393A34">\</span><span class="token plain">activate</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Mac/Linux:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token builtin class-name">source</span><span class="token plain"> google-maps-scraper/bin/activate</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># We plan to use Playwright with Crawlee, so we need to install both:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> crawlee </span><span class="token string" style="color:#e3116c">"crawlee[playwright]"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">playwright </span><span class="token function" style="color:#d73a49">install</span><br></div></code></pre></div></div>
<p><em>If you're new to <strong>Crawlee</strong>, check out its easy-to-follow documentation. It’s available for both <a href="https://www.crawlee.dev/js/docs/quick-start" target="_blank" rel="noopener noreferrer">Node.js</a> and <a href="https://www.crawlee.dev/python/docs/quick-start" target="_blank" rel="noopener noreferrer">Python</a>.</em></p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before going ahead with the project, I'd like to ask you to star Crawlee for Python on <a href="https://github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">GitHub</a>, it helps us to spread the word to fellow scraper developers.</p></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-connecting-to-google-maps">2. Connecting to Google Maps<a href="https://crawlee.dev/blog/scrape-google-maps#2-connecting-to-google-maps" class="hash-link" aria-label="Direct link to 2. Connecting to Google Maps" title="Direct link to 2. Connecting to Google Maps" translate="no">​</a></h3>
<p>Let's see the steps to connect to Google Maps.</p>
<p><strong>Step 1: Setting up the crawler</strong></p>
<p>The first step is to configure the crawler. We're using <a href="https://www.crawlee.dev/python/api/class/PlaywrightCrawler" target="_blank" rel="noopener noreferrer"><code>PlaywrightCrawler</code></a> from Crawlee, which gives us powerful tools for automated browsing. We set <code>headless=False</code> to make the browser visible during scraping and allow 5 minutes for the pages to load.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Initialize crawler with browser visibility and timeout settings</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Shows the browser window while scraping</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        minutes</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Allows plenty of time for page loading</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p><strong>Step 2: Handling each page</strong></p>
<p>This function defines how each page is handled when the crawler visits it. It uses <code>context.page</code> to navigate to the target URL.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">scrape_google_maps</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    Establishes connection to Google Maps and handles the initial page load</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    page </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">goto</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Processing: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p><strong>Step 3: Launching the crawler</strong></p>
<p>Finally, the main function brings everything together. It creates a search URL, sets up the crawler, and starts the scraping process.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Prepare the search URL</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    search_query </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"hotels in bengaluru"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    start_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"https://www.google.com/maps/search/</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">search_query</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">replace</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation string" style="color:#e3116c">' '</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">,</span><span class="token string-interpolation interpolation"> </span><span class="token string-interpolation interpolation string" style="color:#e3116c">'+'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Tell the crawler how to handle each page it visits</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">scrape_google_maps</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Start the scraping process</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">start_url</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"__main__"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Let’s combine the above code snippets and save them in a file named <code>gmap_scraper.py</code>:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">scrape_google_maps</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    Establishes connection to Google Maps and handles the initial page load</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    page </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">goto</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Processing: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    Configures and launches the crawler with custom settings</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Initialize crawler with browser visibility and timeout settings</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Shows the browser window while scraping</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            minutes</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Allows plenty of time for page loading</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Tell the crawler how to handle each page it visits</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">scrape_google_maps</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Prepare the search URL</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    search_query </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"hotels in bengaluru"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    start_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"https://www.google.com/maps/search/</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">search_query</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">replace</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation string" style="color:#e3116c">' '</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">,</span><span class="token string-interpolation interpolation"> </span><span class="token string-interpolation interpolation string" style="color:#e3116c">'+'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Start the scraping process</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">start_url</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"__main__"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Run the code using:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">$ python3 gmap_scraper.py</span><br></div></code></pre></div></div>
<p>When everything works correctly, you'll see the output like this:</p>
<p><img decoding="async" loading="lazy" alt="Connect to page" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-connect-to-page-6d6391022d64446a161825935a307d8d.png" width="1280" height="720" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-import-dependencies-and-defining-scraper-class">3. Import dependencies and defining Scraper Class<a href="https://crawlee.dev/blog/scrape-google-maps#3-import-dependencies-and-defining-scraper-class" class="hash-link" aria-label="Direct link to 3. Import dependencies and defining Scraper Class" title="Direct link to 3. Import dependencies and defining Scraper Class" translate="no">​</a></h3>
<p>Let's start with the basic structure and necessary imports:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> typing </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Dict</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Optional</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Set</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> playwright</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">async_api </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Page</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ElementHandle</span><br></div></code></pre></div></div>
<p>The <code>GoogleMapsScraper</code> class serves as the main scraper engine:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">GoogleMapsScraper</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">__init__</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headless</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">bool</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timeout_minutes</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">int</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headless</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headless</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">minutes</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timeout_minutes</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">processed_names</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Set</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">setup_crawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_scrape_listings</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>This initialization code sets up two crucial components:</p>
<ol>
<li class="">A <code>PlaywrightCrawler</code> instance configured to run either headlessly (without a visible browser window) or with a visible browser</li>
<li class="">A set to track processed business names, preventing duplicate entries</li>
</ol>
<p>The <code>setup_crawler</code> method configures the crawler to use our main scraping function as the default handler for all requests.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-understanding-google-maps-internal-code-structure">4. Understanding Google Maps internal code structure<a href="https://crawlee.dev/blog/scrape-google-maps#4-understanding-google-maps-internal-code-structure" class="hash-link" aria-label="Direct link to 4. Understanding Google Maps internal code structure" title="Direct link to 4. Understanding Google Maps internal code structure" translate="no">​</a></h3>
<p>Before we dive into scraping, let's understand exactly what elements we need to target. When you search for hotels in Bengaluru, Google Maps organizes hotel information in a specific structure. Here's a detailed breakdown of how to locate each piece of information.</p>
<p><strong>Hotel name:</strong></p>
<p><img decoding="async" loading="lazy" alt="Hotel name" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-name-d1fcc59eb4e3eec109fcbf5be0237fbc.webp" width="1906" height="705" class="img_ev3q"></p>
<p><strong>Hotel rating:</strong></p>
<p><img decoding="async" loading="lazy" alt="Hotel rating" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-ratings-7748ca46b1e14126de728add8313d286.webp" width="1908" height="706" class="img_ev3q"></p>
<p><strong>Hotel review count:</strong></p>
<p><img decoding="async" loading="lazy" alt="Hotel Review Count" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-reviews-521c92ebf7eeefb615659e0cd9cce6eb.webp" width="1908" height="709" class="img_ev3q"></p>
<p><strong>Hotel URL:</strong></p>
<p><img decoding="async" loading="lazy" alt="Hotel URL" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-url-ef8f37822fe579765ece5c37c1f8fdeb.webp" width="1905" height="679" class="img_ev3q"></p>
<p><strong>Hotel Price:</strong></p>
<p><img decoding="async" loading="lazy" alt="Hotel Price" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-price-a2ab8516020bfcbfd6054d889f871743.webp" width="1894" height="751" class="img_ev3q"></p>
<p><strong>Hotel amenities:</strong></p>
<p>This returns multiple elements as each hotel has several amenities. We'll need to iterate through these.</p>
<p><img decoding="async" loading="lazy" alt="Hotel amenities" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-amenities-8a138b2fc9d7c4fad6a81bec55ee5db7.webp" width="1731" height="772" class="img_ev3q"></p>
<p><strong>Quick tips:</strong></p>
<ul>
<li class="">Always verify these selectors before scraping, as Google might update them.</li>
<li class="">Use Chrome DevTools (F12) to inspect elements and confirm selectors.</li>
<li class="">Some elements might not be present for all hotels (like prices during the off-season).</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-scraping-google-maps-data-using-identified-selectors">5. Scraping Google Maps data using identified selectors<a href="https://crawlee.dev/blog/scrape-google-maps#5-scraping-google-maps-data-using-identified-selectors" class="hash-link" aria-label="Direct link to 5. Scraping Google Maps data using identified selectors" title="Direct link to 5. Scraping Google Maps data using identified selectors" translate="no">​</a></h3>
<p>Let's build a scraper to extract detailed hotel information from Google Maps. First, create the core scraping function to handle data extraction.</p>
<p><em>gmap_scraper.py:</em></p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_extract_listing_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ElementHandle</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> Optional</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Dict</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Extract structured data from a single listing element."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        name_el </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".qBF1Pd"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> name_el</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        name </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> name_el</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inner_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> name </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">processed_names</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        elements </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"rating"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".MW4etd"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"reviews"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".UY7F9"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"price"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".wcldff"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"link"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"a.hfpxzc"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"address"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".W4Efsd:nth-child(2)"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".W4Efsd:nth-child(1)"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        amenities </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        amenities_els </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector_all</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".dc6iWb"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> amenity </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> amenities_els</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            amenity_text </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> amenity</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_attribute</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"aria-label"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> amenity_text</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                amenities</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">append</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">amenity_text</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        place_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"name"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> name</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"rating"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"rating"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inner_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"rating"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"reviews"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"reviews"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inner_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"()"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"reviews"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"price"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"price"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inner_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"price"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"address"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"address"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inner_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"address"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inner_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"amenities"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> amenities </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> amenities </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"link"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"link"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_attribute</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"href"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> elements</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"link"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">processed_names</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">name</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> place_data</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> Exception </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">exception</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Error extracting listing data"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><br></div></code></pre></div></div>
<p>In the code:</p>
<ul>
<li class=""><code>query_selector</code>: Returns first DOM element matching CSS selector, useful for single items like a name or rating</li>
<li class=""><code>query_selector_all</code>: Returns all matching elements, ideal for multiple items like amenities</li>
<li class=""><code>inner_text()</code>: Extracts text content</li>
<li class="">Some hotels might not have all the information available - we handle this with 'N/A’</li>
</ul>
<p>When you run this script, you'll see output similar to this:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"GRAND KALINGA HOTEL"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"rating"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"4.2"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"reviews"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1,171"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"price"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"\u20b91,760"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"link"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.google.com/maps/place/GRAND+KALINGA+HOTEL/data=!4m10!3m9!1s0x3bae160e0ce07789:0xb15bf736f4238e6a!5m2!4m1!1i2!8m2!3d12.9762259!4d77.5786043!16s%2Fg%2F11sp32pz28!19sChIJiXfgDA4WrjsRao4j9Db3W7E?authuser=0&amp;hl=en&amp;rclk=1"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"amenities"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Pool available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Free breakfast available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Free Wi-Fi available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"Free parking available"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="6-managing-infinite-scrolling">6. Managing Infinite Scrolling<a href="https://crawlee.dev/blog/scrape-google-maps#6-managing-infinite-scrolling" class="hash-link" aria-label="Direct link to 6. Managing Infinite Scrolling" title="Direct link to 6. Managing Infinite Scrolling" translate="no">​</a></h3>
<p>Google Maps uses infinite scrolling to load more results as users scroll down. We handle this with a dedicated method:</p>
<p>First, we need a function that can handle the scrolling and detect when we've hit the bottom. Copy-paste this new function in the <code>gmap_scraper.py</code> file:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_load_more_items</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token builtin">bool</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Scroll down to load more items."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            feed </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'div[role="feed"]'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> feed</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">False</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            prev_scroll </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> feed</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">evaluate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"(element) =&gt; element.scrollTop"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> feed</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">evaluate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"(element) =&gt; element.scrollTop += 800"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for_timeout</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">2000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            new_scroll </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> feed</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">evaluate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"(element) =&gt; element.scrollTop"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> new_scroll </span><span class="token operator" style="color:#393A34">&lt;=</span><span class="token plain"> prev_scroll</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">False</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for_timeout</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">1000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">True</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> Exception </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">exception</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Error during scroll"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">False</span><br></div></code></pre></div></div>
<p>Run this code using:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">$ python3 gmap_scraper.py</span><br></div></code></pre></div></div>
<p>You should see an output like this:</p>
<p><img decoding="async" loading="lazy" alt="scrape-google-maps-with-crawlee-screenshot-handle-pagination" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-handle-pagination-319232595ced535f175346ae0003e32f.webp" width="1125" height="120" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="7-scrape-listings">7. Scrape Listings<a href="https://crawlee.dev/blog/scrape-google-maps#7-scrape-listings" class="hash-link" aria-label="Direct link to 7. Scrape Listings" title="Direct link to 7. Scrape Listings" translate="no">​</a></h3>
<p>The main scraping function ties everything together. It scrapes listings from the page by repeatedly extracting data and scrolling.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_scrape_listings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Main scraping function to process all listings"""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        page </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"\nProcessing URL: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">\n"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for_selector</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".Nv2PK"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">30000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for_timeout</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">2000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">while</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            listings </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">query_selector_all</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">".Nv2PK"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            new_items </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> listing </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> listings</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                place_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_extract_listing_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">listing</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> place_data</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">place_data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    new_items </span><span class="token operator" style="color:#393A34">+=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Processed: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">place_data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'name'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> new_items </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">and</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_load_more_items</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">break</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> new_items </span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_load_more_items</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"\nFinished processing! Total items: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation builtin">len</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">processed_names</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> Exception </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Error in scraping: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation builtin">str</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation">e</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The scraper uses Crawlee's built-in storage system to manage scraped data. When you run the scraper, it creates a <code>storage</code> directory in your project with several key components:</p>
<ul>
<li class=""><code>datasets/</code>: Contains the scraped results in JSON format</li>
<li class=""><code>key_value_stores/</code>: Stores crawler state and metadata</li>
<li class=""><code>request_queues/</code>: Manages URLs to be processed</li>
</ul>
<p>The <code>push_data()</code> method we use in our scraper sends the data to Crawlee's dataset storage as you can see below:</p>
<p><img decoding="async" loading="lazy" alt="Crawlee push_data" src="https://crawlee.dev/assets/images/How-to-scrape-Google-Maps-data-using-Python-and-Crawlee-metadata-a27257a5ffffad0fdcc598064445fe57.webp" width="1192" height="542" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="8-running-the-scraper">8. Running the Scraper<a href="https://crawlee.dev/blog/scrape-google-maps#8-running-the-scraper" class="hash-link" aria-label="Direct link to 8. Running the Scraper" title="Direct link to 8. Running the Scraper" translate="no">​</a></h3>
<p>Finally, we need functions to execute our scraper:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> search_query</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Execute the scraper with a search query"""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">setup_crawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        start_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"https://www.google.com/maps/search/</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">search_query</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">replace</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation string" style="color:#e3116c">' '</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">,</span><span class="token string-interpolation interpolation"> </span><span class="token string-interpolation interpolation string" style="color:#e3116c">'+'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">start_url</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'gmap_data.json'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> Exception </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Error running scraper: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation builtin">str</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation">e</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Entry point of the script"""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    scraper </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> GoogleMapsScraper</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    search_query </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"hotels in bengaluru"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> scraper</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">search_query</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"__main__"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>This data is automatically stored and can later be exported to a JSON file using:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'gmap_data.json'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Here's what your exported JSON file will look like:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Vividus Hotels, Bangalore"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"rating"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"4.3"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"reviews"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"633"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"price"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"₹3,667"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"amenities"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token string" style="color:#e3116c">"Pool available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token string" style="color:#e3116c">"Free breakfast available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token string" style="color:#e3116c">"Free Wi-Fi available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token string" style="color:#e3116c">"Free parking available"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"link"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..."</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">]</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="9-using-proxies-for-google-maps-scraping">9. Using proxies for Google Maps scraping<a href="https://crawlee.dev/blog/scrape-google-maps#9-using-proxies-for-google-maps-scraping" class="hash-link" aria-label="Direct link to 9. Using proxies for Google Maps scraping" title="Direct link to 9. Using proxies for Google Maps scraping" translate="no">​</a></h3>
<p>When scraping Google Maps at scale, using proxies is very helpful. Here are a few key reasons why:</p>
<ol>
<li class=""><strong>Avoid IP blocks</strong>: Google Maps can detect and block IP addresses that make an excessive number of requests in a short time. Using proxies helps you stay under the radar.</li>
<li class=""><strong>Bypass rate limits</strong>: Google implements strict limits on the number of requests per IP address. By rotating through multiple IPs, you can maintain a consistent scraping pace without hitting these limits.</li>
<li class=""><strong>Access location-specific data</strong>: Different regions may display different data on Google Maps. Proxies allow you to view listings as if you are browsing from any specific location.</li>
</ol>
<p>Here's a simple implementation using Crawlee's built-in proxy management. Update your previous code with this to use proxy settings.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">proxy_configuration </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ProxyConfiguration</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Configure your proxy settings</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">proxy_configuration </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ProxyConfiguration</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    proxy_urls</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"http://username:password@proxy.provider.com:12345"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Add more proxy URLs as needed</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Initialize crawler with proxy support</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    request_handler_timeout</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">minutes</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    proxy_configuration</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">proxy_configuration</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Here, I use a proxy to scrape hotel data in New York City.</p>
<p><img decoding="async" loading="lazy" alt="Using a proxy" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-proxies-5c4dece0247a87e7d338328c472cea74.webp" width="1791" height="833" class="img_ev3q"></p>
<p>Here's an example of data scraped from New York City hotels using proxies:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"The Manhattan at Times Square Hotel"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"rating"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"3.1"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"reviews"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"8,591"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"price"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"$120"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"amenities"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Free parking available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Free Wi-Fi available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Air-conditioned available"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Breakfast available"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"link"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.google.com/maps/place/..."</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="10-project-interactive-hotel-analysis-dashboard">10. Project: Interactive hotel analysis dashboard<a href="https://crawlee.dev/blog/scrape-google-maps#10-project-interactive-hotel-analysis-dashboard" class="hash-link" aria-label="Direct link to 10. Project: Interactive hotel analysis dashboard" title="Direct link to 10. Project: Interactive hotel analysis dashboard" translate="no">​</a></h3>
<p>After scraping hotel data from Google Maps, you can build an interactive dashboard that helps analyze hotel trends. Here’s a preview of how the dashboard works:</p>
<p><img decoding="async" loading="lazy" alt="Final dashboard" src="https://crawlee.dev/assets/images/scrape-google-maps-with-crawlee-screenshot-hotel-analysis-dashboard-c14806409a7c1db63943f58d855aa07e.webp" width="1905" height="833" class="img_ev3q"></p>
<p>Find the complete info for this dashboard on GitHub: <a href="https://github.com/triposat/Hotel-Analytics-Dashboard" target="_blank" rel="noopener noreferrer">Hotel Analysis Dashboard</a>.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="11-now-youre-ready-to-put-everything-into-action">11. Now you’re ready to put everything into action!<a href="https://crawlee.dev/blog/scrape-google-maps#11-now-youre-ready-to-put-everything-into-action" class="hash-link" aria-label="Direct link to 11. Now you’re ready to put everything into action!" title="Direct link to 11. Now you’re ready to put everything into action!" translate="no">​</a></h3>
<p>Take a look at the complete scripts in my GitHub Gist:</p>
<ul>
<li class=""><a href="https://gist.github.com/triposat/9a6fb03130f3c4332bab71b72a973940" target="_blank" rel="noopener noreferrer">Basic Scraper</a></li>
<li class=""><a href="https://gist.github.com/triposat/6c554b13c787a55348b48b6bfc5459c0" target="_blank" rel="noopener noreferrer">Code with Proxy Integration</a></li>
<li class=""><a href="https://gist.github.com/triposat/13ce4b05c36512e69b5602833e781a6c" target="_blank" rel="noopener noreferrer">Hotel Analysis Dashboard</a></li>
</ul>
<p>To make it all work:</p>
<ol>
<li class=""><strong>Run the basic scraper or proxy-integrated scraper</strong>: This will collect the hotel data and store it in a JSON file.</li>
<li class=""><strong>Run the dashboard script</strong>: Load your JSON data and view it interactively in the dashboard.</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="wrapping-up-and-next-steps">Wrapping up and next steps<a href="https://crawlee.dev/blog/scrape-google-maps#wrapping-up-and-next-steps" class="hash-link" aria-label="Direct link to Wrapping up and next steps" title="Direct link to Wrapping up and next steps" translate="no">​</a></h2>
<p>You've successfully built a comprehensive Google Maps scraper that collects and processes hotel data, presenting it through an interactive dashboard. Now you’ve learned about:</p>
<ul>
<li class="">Using Crawlee with Playwright to navigate and extract data from Google Maps</li>
<li class="">Using proxies to scale up scraping without getting blocked</li>
<li class="">Storing the extracted data in JSON format</li>
<li class="">Creating an interactive dashboard to analyze hotel data</li>
</ul>
<p>We’ve handpicked some great resources to help you further explore web scraping:</p>
<ul>
<li class=""><a href="https://www.crawlee.dev/blog/scrapy-vs-crawlee" target="_blank" rel="noopener noreferrer">Scrapy vs. Crawlee: Choosing the right tool</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/proxy-management-in-crawlee" target="_blank" rel="noopener noreferrer">Mastering proxy management with Crawlee</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/web-scraping-tips" target="_blank" rel="noopener noreferrer">Think like a web scraping expert: 12 pro tips</a></li>
<li class=""><a href="https://www.crawlee.dev/blog/linkedin-job-scraper-python" target="_blank" rel="noopener noreferrer">Building a LinkedIn job scraper</a></li>
</ul>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[How to scrape Google search results with Python]]></title>
            <link>https://crawlee.dev/blog/scrape-google-search</link>
            <guid>https://crawlee.dev/blog/scrape-google-search</guid>
            <pubDate>Mon, 02 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape google search results using Crawlee for Python]]></description>
            <content:encoded><![CDATA[<p>Scraping <code>Google Search</code> delivers essential <code>SERP analysis</code>, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">discord channel</a>.</p></div></div>
<p>In this guide, we'll create a Google Search scraper using <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer"><code>Crawlee for Python</code></a> that can handle result ranking and pagination.</p>
<p>We'll create a scraper that:</p>
<ul>
<li class="">Extracts titles, URLs, and descriptions from search results</li>
<li class="">Handles multiple search queries</li>
<li class="">Tracks ranking positions</li>
<li class="">Processes multiple result pages</li>
<li class="">Saves data in a structured format</li>
</ul>
<p><img decoding="async" loading="lazy" alt="How to scrape Google search results with Python" src="https://crawlee.dev/assets/images/google-search-a91bfdf17a4c2860798444b1be56f625.webp" width="1152" height="649" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites">Prerequisites<a href="https://crawlee.dev/blog/scrape-google-search#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h2>
<ul>
<li class="">Python 3.7 or higher</li>
<li class="">Basic understanding of HTML and CSS selectors</li>
<li class="">Familiarity with web scraping concepts</li>
<li class="">Crawlee for Python v0.4.2 or higher</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="project-setup">Project setup<a href="https://crawlee.dev/blog/scrape-google-search#project-setup" class="hash-link" aria-label="Direct link to Project setup" title="Direct link to Project setup" translate="no">​</a></h3>
<ol>
<li class="">
<p>Install Crawlee with required dependencies:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pipx </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">beautifulsoup,curl-impersonate</span><span class="token punctuation" style="color:#393A34">]</span><br></div></code></pre></div></div>
</li>
<li class="">
<p>Create a new project using Crawlee CLI:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pipx run crawlee create crawlee-google-search</span><br></div></code></pre></div></div>
</li>
<li class="">
<p>When prompted, select <code>Beautifulsoup</code> as your template type.</p>
</li>
<li class="">
<p>Navigate to the project directory and complete installation:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token builtin class-name">cd</span><span class="token plain"> crawlee-google-search</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">poetry </span><span class="token function" style="color:#d73a49">install</span><br></div></code></pre></div></div>
</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="development-of-the-google-search-scraper-in-python">Development of the Google Search scraper in Python<a href="https://crawlee.dev/blog/scrape-google-search#development-of-the-google-search-scraper-in-python" class="hash-link" aria-label="Direct link to Development of the Google Search scraper in Python" title="Direct link to Development of the Google Search scraper in Python" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-defining-data-for-extraction">1. Defining data for extraction<a href="https://crawlee.dev/blog/scrape-google-search#1-defining-data-for-extraction" class="hash-link" aria-label="Direct link to 1. Defining data for extraction" title="Direct link to 1. Defining data for extraction" translate="no">​</a></h3>
<p>First, let's define our extraction scope. Google's search results now include maps, notable people, company details, videos, common questions, and many other elements. We'll focus on analyzing standard search results with rankings.</p>
<p>Here's what we'll be extracting:</p>
<p><img decoding="async" loading="lazy" alt="Search Example" src="https://crawlee.dev/assets/images/search_example-53f4fdf556178b9478a8d4f3e3816669.webp" width="1873" height="813" class="img_ev3q"></p>
<p>Let's verify whether we can extract the necessary data from the page's HTML code, or if we need deeper analysis or <code>JS</code> rendering. Note that this verification is sensitive to HTML tags:</p>
<p><img decoding="async" loading="lazy" alt="Check Html" src="https://crawlee.dev/assets/images/check_html-e243b1a0eff6d4404b9034863969bedc.webp" width="1917" height="951" class="img_ev3q"></p>
<p>Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use <a href="https://www.crawlee.dev/python/docs/examples/beautifulsoup-crawler" target="_blank" rel="noopener noreferrer"><code>beautifulsoup_crawler</code></a>.</p>
<p>The fields we'll extract:</p>
<ul>
<li class="">Search result titles</li>
<li class="">URLs</li>
<li class="">Description text</li>
<li class="">Ranking positions</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-configure-the-crawler">2. Configure the crawler<a href="https://crawlee.dev/blog/scrape-google-search#2-configure-the-crawler" class="hash-link" aria-label="Direct link to 2. Configure the crawler" title="Direct link to 2. Configure the crawler" translate="no">​</a></h3>
<p>First, let's create the crawler configuration.</p>
<p>We'll use <a href="https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient" target="_blank" rel="noopener noreferrer"><code>CurlImpersonateHttpClient</code></a> as our <code>http_client</code> with preset <code>headers</code> and <code>impersonate</code> relevant to the <a href="https://www.google.com/intl/en/chrome/" target="_blank" rel="noopener noreferrer"><code>Chrome</code></a> browser.</p>
<p>We'll also configure <a href="https://www.crawlee.dev/python/api/class/ConcurrencySettings" target="_blank" rel="noopener noreferrer"><code>ConcurrencySettings</code></a> to control scraping aggressiveness. This is crucial to avoid getting blocked by Google.</p>
<p>If you need to extract data more intensively, consider setting up <a href="https://www.crawlee.dev/python/api/class/ProxyConfiguration" target="_blank" rel="noopener noreferrer"><code>ProxyConfiguration</code></a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">beautifulsoup_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BeautifulSoupCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">curl_impersonate </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CurlImpersonateHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> HttpHeaders</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    concurrency_settings </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_concurrency</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">200</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    http_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> CurlImpersonateHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">impersonate</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"chrome124"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                            headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HttpHeaders</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"referer"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.google.com/"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                     </span><span class="token string" style="color:#e3116c">"accept-language"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"en"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                     </span><span class="token string" style="color:#e3116c">"accept-encoding"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"gzip, deflate, br, zstd"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                     </span><span class="token string" style="color:#e3116c">"user-agent"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> BeautifulSoupCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_request_retries</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">concurrency_settings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">http_client</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_crawl_depth</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.google.com/search?q=Apify'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-implementing-data-extraction">3. Implementing data extraction<a href="https://crawlee.dev/blog/scrape-google-search#3-implementing-data-extraction" class="hash-link" aria-label="Direct link to 3. Implementing data extraction" title="Direct link to 3. Implementing data extraction" translate="no">​</a></h3>
<p>First, let's analyze the HTML code of the elements we need to extract:</p>
<p><img decoding="async" loading="lazy" alt="Check Html" src="https://crawlee.dev/assets/images/html_example-ccefa4ed63c38812ac5b8ca7b5122c8c.webp" width="1916" height="931" class="img_ev3q"></p>
<p>There's an obvious distinction between <em>readable</em> ID attributes and <em>generated</em> class names and other attributes. When creating selectors for data extraction, you should ignore any generated attributes. Even if you've read that Google has been using a particular generated tag for N years, you shouldn't rely on it - this reflects your experience in writing robust code.</p>
<p>Now that we understand the HTML structure, let's implement the extraction. As our crawler deals with only one type of page, we can use <code>router.default_handler</code> for processing it. Within the handler, we'll use <code>BeautifulSoup</code> to iterate through each search result, extracting data such as <code>title</code>, <code>url</code>, and <code>text_widget</code> while saving the results.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BeautifulSoupCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Default request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"div#search div#rso div[data-hveid][lang]"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"h3"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"url"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"a"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"href"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"text_widget"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"div[style*='line']"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-handling-pagination">4. Handling pagination<a href="https://crawlee.dev/blog/scrape-google-search#4-handling-pagination" class="hash-link" aria-label="Direct link to 4. Handling pagination" title="Direct link to 4. Handling pagination" translate="no">​</a></h3>
<p>Since Google results depend on the IP geolocation of the search request, we can't rely on link text for pagination. We need to create a more sophisticated CSS selector that works regardless of geolocation and language settings.</p>
<p>The <code>max_crawl_depth</code> parameter controls how many pages our crawler should scan. Once we have our robust selector, we simply need to get the next page link and add it to the crawler's queue.</p>
<p>To write more efficient selectors, learn the basics of <a href="https://www.w3schools.com/cssref/css_selectors.php" target="_blank" rel="noopener noreferrer">CSS</a> and <a href="https://www.w3schools.com/xml/xpath_syntax.asp" target="_blank" rel="noopener noreferrer">XPath</a> syntax.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">selector</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"div[role='navigation'] td[role='heading']:last-of-type &gt; a"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-exporting-data-to-csv-format">5. Exporting data to CSV format<a href="https://crawlee.dev/blog/scrape-google-search#5-exporting-data-to-csv-format" class="hash-link" aria-label="Direct link to 5. Exporting data to CSV format" title="Direct link to 5. Exporting data to CSV format" translate="no">​</a></h3>
<p>Since we want to save all search result data in a convenient tabular format like CSV, we can simply add the export_data method call right after running the crawler:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_csv</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"google_search.csv"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="6-finalizing-the-google-search-scraper">6. Finalizing the Google Search scraper<a href="https://crawlee.dev/blog/scrape-google-search#6-finalizing-the-google-search-scraper" class="hash-link" aria-label="Direct link to 6. Finalizing the Google Search scraper" title="Direct link to 6. Finalizing the Google Search scraper" translate="no">​</a></h3>
<p>While our core crawler logic works, you might have noticed that our results currently lack ranking position information. To complete our scraper, we need to implement proper ranking position tracking by passing data between requests using <code>user_data</code> in <a href="https://www.crawlee.dev/python/api/class/Request" target="_blank" rel="noopener noreferrer"><code>Request</code></a>.</p>
<p>Let's modify the script to handle multiple queries and track ranking positions for search results analysis. We'll also set the crawling depth as a top-level variable. Let's move the <code>router.default_handler</code> to <code>routes.py</code> to match the project structure:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># crawlee-google-search.main</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">beautifulsoup_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BeautifulSoupCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> BeautifulSoupCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">curl_impersonate </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CurlImpersonateHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> HttpHeaders</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">QUERIES </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"Apify"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Crawlee"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">CRAWL_DEPTH </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    concurrency_settings </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ConcurrencySettings</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_concurrency</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> max_tasks_per_minute</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">200</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    http_client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> CurlImpersonateHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">impersonate</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"chrome124"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                            headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HttpHeaders</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"referer"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.google.com/"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                     </span><span class="token string" style="color:#e3116c">"accept-language"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"en"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                     </span><span class="token string" style="color:#e3116c">"accept-encoding"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"gzip, deflate, br, zstd"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                     </span><span class="token string" style="color:#e3116c">"user-agent"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> BeautifulSoupCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_request_retries</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        concurrency_settings</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">concurrency_settings</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        http_client</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">http_client</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_crawl_depth</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">CRAWL_DEPTH</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    requests_lists </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"https://www.google.com/search?q=</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">query</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"query"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> query</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> query </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> QUERIES</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests_lists</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data_csv</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"google_ranked.csv"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Let's also modify the handler to add <code>query</code> and <code>order_no</code> fields and basic error handling:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># crawlee-google-search.routes</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">beautifulsoup_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BeautifulSoupCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">BeautifulSoupCrawlingContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BeautifulSoupCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Default request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> ...'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    order </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"last_order"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    query </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"query"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">soup</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"div#search div#rso div[data-hveid][lang]"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">"query"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> query</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">"order_no"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> order</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"h3"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">"url"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"a"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"href"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">"text_widget"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> item</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select_one</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"div[style*='line']"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            order </span><span class="token operator" style="color:#393A34">+=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> AttributeError </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">warning</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Attribute error for query "</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">query</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">": </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation builtin">str</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation">e</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">except</span><span class="token plain"> Exception </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> e</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">error</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Unexpected error for query "</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">query</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">": </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation builtin">str</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation">e</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">selector</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"div[role='navigation'] td[role='heading']:last-of-type &gt; a"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"last_order"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> order</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"query"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> query</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>And we're done!</p>
<p>Our Google Search crawler is ready. Let's look at the results in the <code>google_ranked.csv</code> file:</p>
<p><img decoding="async" loading="lazy" alt="Results CSV" src="https://crawlee.dev/assets/images/results-03c51354b4347837a24ec6977a442ce8.webp" width="1319" height="588" class="img_ev3q"></p>
<p>The code repository is available on <a href="https://github.com/Mantisus/crawlee-google-search" target="_blank" rel="noopener noreferrer"><code>GitHub</code></a></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="scrape-google-search-results-with-apify">Scrape Google Search results with Apify<a href="https://crawlee.dev/blog/scrape-google-search#scrape-google-search-results-with-apify" class="hash-link" aria-label="Direct link to Scrape Google Search results with Apify" title="Direct link to Scrape Google Search results with Apify" translate="no">​</a></h2>
<p>If you're working on a large-scale project requiring millions of data points, like the project featured in this <a href="https://backlinko.com/search-engine-ranking" target="_blank" rel="noopener noreferrer">article about Google ranking analysis</a> - you might need a ready-made solution.</p>
<p>Consider using <a href="https://www.apify.com/apify/google-search-scraper" target="_blank" rel="noopener noreferrer"><code>Google Search Results Scraper</code></a> by the Apify team.</p>
<p>It offers important features such as:</p>
<ul>
<li class="">Proxy support</li>
<li class="">Scalability for large-scale data extraction</li>
<li class="">Geolocation control</li>
<li class="">Integration with external services like <a href="https://zapier.com/" target="_blank" rel="noopener noreferrer"><code>Zapier</code></a>, <a href="https://www.make.com/" target="_blank" rel="noopener noreferrer"><code>Make</code></a>, <a href="https://airbyte.com/" target="_blank" rel="noopener noreferrer"><code>Airbyte</code></a>, <a href="https://www.langchain.com/" target="_blank" rel="noopener noreferrer"><code>LangChain</code></a> and others</li>
</ul>
<p>You can learn more in the Apify <a href="https://blog.apify.com/unofficial-google-search-api-from-apify-22a20537a951/" target="_blank" rel="noopener noreferrer">blog</a></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-will-you-scrape">What will you scrape?<a href="https://crawlee.dev/blog/scrape-google-search#what-will-you-scrape" class="hash-link" aria-label="Direct link to What will you scrape?" title="Direct link to What will you scrape?" translate="no">​</a></h2>
<p>In this blog, we've explored step-by-step how to create a Google Search crawler that collects ranking data. How you analyze this dataset is up to you!</p>
<p>As a reminder, you can find the full project code on <a href="https://github.com/Mantisus/crawlee-google-search" target="_blank" rel="noopener noreferrer"><code>GitHub</code></a>.</p>
<p>I'd like to think that in 5 years I'll need to write an article on "How to extract data from the best search engine for LLMs", but I suspect that in 5 years this article will still be relevant.</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[Reverse engineering GraphQL persistedQuery extension]]></title>
            <link>https://crawlee.dev/blog/graphql-persisted-query</link>
            <guid>https://crawlee.dev/blog/graphql-persisted-query</guid>
            <pubDate>Fri, 15 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to do reverse engineering on persistedQuery extension by GraphQL and reveal the query hash needed for scraping.]]></description>
            <content:encoded><![CDATA[<p>GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries.</p>
<p>The request is usually a POST to some general <code>/graphql</code> endpoint with a body like this:</p>
<p><img decoding="async" loading="lazy" alt="GraphQL Query" src="https://crawlee.dev/assets/images/graphql-a3962ed441b2a078e43c8158ad64336a.webp" width="853" height="275" class="img_ev3q"></p>
<p>When scraping data from websites using GraphQL, it’s common to inspect the network requests in developer tools to find the exact queries being used. However, on some websites, you might notice that the GraphQL query itself isn’t visible in the request. Instead, you only see a cryptic hash value. This can be confusing and makes it harder to understand how data is being requested from the server.</p>
<p>This is because some websites use a feature called <a href="https://www.apollographql.com/docs/apollo-server/performance/apq/" target="_blank" rel="noopener noreferrer">"persisted queries.</a> It's a performance optimization that reduces the amount of data sent with each request by replacing the full query text with a precomputed hash. While this improves website speed and efficiency, it introduces challenges for scraping because the query text isn’t readily available.</p>
<p><img decoding="async" loading="lazy" alt="Persisted Query Reverse Engineering" src="https://crawlee.dev/assets/images/graphql-persisted-query-6e36e61d76503e617fe4e7651bdf53a3.webp" width="1152" height="649" class="img_ev3q"></p>
<p>TLDR: the client computes the sha256 hash of the <code>query</code> text and only sends that hash. In addition, you can possibly fit all of this into the query string of a GET request, making it easily cachable. Below is an example request from Zillow</p>
<p><img decoding="async" loading="lazy" alt="Request from Zillow" src="https://crawlee.dev/assets/images/zillow-ebd03223cb4ed6af11e972135e854851.webp" width="2396" height="512" class="img_ev3q"></p>
<p>As you can see, it’s just some metadata about the persistedQuery extension, the hash of the query, and variables to be embedded in the query.</p>
<p>Here’s another request from expedia.com, sent as a POST, but with the same extension:</p>
<p><img decoding="async" loading="lazy" alt="Expedia Query" src="https://crawlee.dev/assets/images/expedia-2e5f3670fa2a7fe4b27c9e5f93e5ec5a.webp" width="1561" height="726" class="img_ev3q"></p>
<p>This primarily optimizes website performance, but it creates several challenges for web scraping:</p>
<ul>
<li class="">GET requests are usually more prone to being blocked.</li>
<li class="">Hidden Query Parameters: We don’t know the full query, so if the website responds with a “Persisted query not found” error (asking us to send the query in full, not just the hash), we can’t send it.</li>
<li class="">Once the website changes even a little bit and the clients start asking for a new query - even though the old one might still work, the server will very soon forget its ID/hash, and your request with this hash will never work again, since you can’t “remind” the server of the full query text.</li>
</ul>
<p>For various reasons, you might need to extract the entire GraphQL query text, but this can be tricky. While you could inspect the website’s JavaScript to find the query text, it’s often dynamically constructed from multiple fragments, making it hard to piece together.</p>
<p>Instead, we’ll take a more direct approach: tricking the client application (e.g., the browser) into revealing the full query. When the client uses a hash that the server doesn't recognize, the server typically responds with an error message like <code>PersistedQueryNotFound</code>. This prompts the client to resend the full query in a subsequent request. By intercepting and modifying the original request to include an invalid hash, we can trigger this behavior and capture the complete query text. This method avoids digging through JavaScript and relies on the natural error-handling flow of the client-server interaction.</p>
<p>For exactly this use case, a perfect tool exists: <a href="https://mitmproxy.org/" target="_blank" rel="noopener noreferrer">mitmproxy</a>, an open-source Python library that intercepts requests made by your own devices, websites, or apps and allows you to modify them with simple Python scripts.</p>
<p>Download <code>mitmproxy</code>, and prepare a Python script like this:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> json</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">flow</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        dat </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">flow</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        dat</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"extensions"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"persistedQuery"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"sha256Hash"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"0d9e"</span><span class="token plain"> </span><span class="token comment" style="color:#999988;font-style:italic"># any bogus hex string here</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        flow</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">dumps</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">dat</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">except</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">pass</span><br></div></code></pre></div></div>
<p>This defines a hook that <code>mitmproxy</code> will run on every request: it tries to load the request's JSON body, modifies the hash to an arbitrary value, and writes the updated JSON as a new body of the request.</p>
<p>We also need to make sure we reroute our browser requests to <code>mitmproxy</code>. For this purpose we are going to use a browser extension called <a href="https://chromewebstore.google.com/detail/foxyproxy/gcknhkkoolaabfmlnjonogaaifnjlfnp?hl=en" target="_blank" rel="noopener noreferrer">FoxyProxy</a>. It is available in both Firefox and Chrome.</p>
<p>Just add a route with these settings:</p>
<p><img decoding="async" loading="lazy" alt="mitmproxy settings" src="https://crawlee.dev/assets/images/mitmprpxy-1e6b253c473a57f3451077aae16640b6.webp" width="1418" height="394" class="img_ev3q"></p>
<p>Now we can run <code>mitmproxy</code> with this script: <code>mitmweb -s script.py</code></p>
<p>This will open a browser tab where you can watch all the intercepted requests in real-time.</p>
<p><img decoding="async" loading="lazy" alt="Browser tab" src="https://crawlee.dev/assets/images/browser-408715fa1be9f079c6672f7f3ae59644.webp" width="1439" height="809" class="img_ev3q"></p>
<p>If you go to the particular path and see the query in the request section, you will see some garbage value has replaced the hash.</p>
<p><img decoding="async" loading="lazy" alt="Replaced hash" src="https://crawlee.dev/assets/images/request-6f8330f873c988f6dd07d358130627bd.webp" width="1439" height="809" class="img_ev3q"></p>
<p>Now, if you visit Zillow and open that particular path that we tried for the extension, and go to the response section, the client-side receives the PersistedQueryNotFound error.</p>
<p><img decoding="async" loading="lazy" alt="Persisted query error" src="https://crawlee.dev/assets/images/error-2b5eed861143a45328231c6629406454.webp" width="1439" height="809" class="img_ev3q"></p>
<p>The front end of Zillow reacts with sending the whole query as a POST request.</p>
<p><img decoding="async" loading="lazy" alt="POST request" src="https://crawlee.dev/assets/images/query-b793b6bbe82994b3d38a565204f82e11.webp" width="1439" height="809" class="img_ev3q"></p>
<p>We extract the query and hash directly from this POST request. To ensure that the Zillow server does not forget about this hash, we periodically run this POST request with the exact same query and hash. This will ensure that the scraper continues to work even when the server's cache is cleaned or reset or the website changes.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/graphql-persisted-query#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>Persisted queries are a powerful optimization tool for GraphQL APIs, enhancing website performance by minimizing payload sizes and enabling GET request caching. However, they also pose significant challenges for web scraping, primarily due to the reliance on server-stored hashes and the potential for those hashes to become invalid.</p>
<p>Using <code>mitmproxy</code> to intercept and manipulate GraphQL requests gives an efficient approach to reveal  the full query text without delving into complex client-side JavaScript. By forcing the server to respond with a <code>PersistedQueryNotFound</code> error, we can capture the full query payload and utilize it for scraping purposes. Periodically running the extracted query ensures the scraper remains functional, even when server-side cache resets occur or the website evolves.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[12 tips on how to think like a web scraping expert]]></title>
            <link>https://crawlee.dev/blog/web-scraping-tips</link>
            <guid>https://crawlee.dev/blog/web-scraping-tips</guid>
            <pubDate>Sun, 10 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to think and scrape like a web scraping expert.]]></description>
            <content:encoded><![CDATA[<p>Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">discord channel</a>.</p></div></div>
<p>In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results.</p>
<p>So, let's explore the mindset of a web scraping developer.</p>
<p><img decoding="async" loading="lazy" alt="How to think like a web scraping expert" src="https://crawlee.dev/assets/images/scraping-tips-8c538d5ae19dc1737b083169ad2a203b.webp" width="1152" height="649" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-choosing-a-data-source-for-the-project">1. Choosing a data source for the project<a href="https://crawlee.dev/blog/web-scraping-tips#1-choosing-a-data-source-for-the-project" class="hash-link" aria-label="Direct link to 1. Choosing a data source for the project" title="Direct link to 1. Choosing a data source for the project" translate="no">​</a></h2>
<p>When you start working on a project, you likely have a target site from which you need to extract specific data. Check what possibilities this site or application provides for data extraction. Here are some possible options:</p>
<ul>
<li class=""><code>Official API</code> - the site may provide a free official API through which you can get all the necessary data. This is the best option for you. For example, you can consider this approach if you need to extract data from <a href="https://docs.developer.yelp.com/docs/fusion-intro" target="_blank" rel="noopener noreferrer"><code>Yelp</code></a></li>
<li class=""><code>Website</code> - in this case, we study the website, its structure, as well as the ways the frontend and backend interact</li>
<li class=""><code>Mobile Application</code> - in some cases, there's no website or API at all, or the mobile application provides more data, in which case, don't forget about the <a href="https://blog.apify.com/using-a-man-in-the-middle-proxy-to-scrape-data-from-a-mobile-app-api-e954915f979d/" target="_blank" rel="noopener noreferrer"><code>man-in-the-middle</code></a> approach</li>
</ul>
<p>If one data source fails, try accessing another available source.</p>
<p>For example, for <code>Yelp</code>, all three options are available, and if the <code>Official API</code> doesn't suit you for some reason, you can try the other two.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-check-robotstxt-and-sitemap">2. Check <a href="https://developers.google.com/search/docs/crawling-indexing/robots/intro" target="_blank" rel="noopener noreferrer"><code>robots.txt</code></a> and <a href="https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap" target="_blank" rel="noopener noreferrer"><code>sitemap</code></a><a href="https://crawlee.dev/blog/web-scraping-tips#2-check-robotstxt-and-sitemap" class="hash-link" aria-label="Direct link to 2-check-robotstxt-and-sitemap" title="Direct link to 2-check-robotstxt-and-sitemap" translate="no">​</a></h2>
<p>I think everyone knows about <code>robots.txt</code> and <code>sitemap</code> one way or another, but I regularly see people simply forgetting about them. If you're hearing about these for the first time, here's a quick explanation:</p>
<ul>
<li class=""><code>robots</code> is the established name for crawlers in SEO. Usually, this refers to crawlers of major search engines like Google and Bing, or services like Ahrefs and ChatGPT.</li>
<li class=""><code>robots.txt</code> is a file describing the allowed behavior for robots. It includes permitted crawler user-agents, wait time between page scans, patterns of pages forbidden for scanning, and more. These rules are typically based on which pages should be indexed by search engines and which should not.</li>
<li class=""><code>sitemap</code> describes the site structure to make it easier for robots to navigate. It also helps in scanning only the content that needs updating, without creating unnecessary load on the site</li>
</ul>
<p>Since you're not <a href="http://google.com/" target="_blank" rel="noopener noreferrer"><code>Google</code></a> or any other popular search engine, the robot rules in <code>robots.txt</code> will likely be against you. But combined with the <code>sitemap</code>, this is a good place to study the site structure, expected interaction with robots, and non-browser user-agents. In some situations, it simplifies data extraction from the site.</p>
<p>For example, using the <a href="https://www.crawlee.dev/sitemap.xml" target="_blank" rel="noopener noreferrer"><code>sitemap</code></a> for <a href="http://www.crawlee.dev/" target="_blank" rel="noopener noreferrer">Crawlee website</a>, you can easily get direct links to posts both for the entire lifespan of the blog and for a specific period. One simple check, and you don't need to implement pagination logic.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-dont-neglect-site-analysis">3. Don't neglect site analysis<a href="https://crawlee.dev/blog/web-scraping-tips#3-dont-neglect-site-analysis" class="hash-link" aria-label="Direct link to 3. Don't neglect site analysis" title="Direct link to 3. Don't neglect site analysis" translate="no">​</a></h2>
<p>Thorough site analysis is an important prerequisite for creating an effective web scraper, especially if you're not planning to use browser automation. However, such analysis takes time, sometimes a lot of it.</p>
<p>It's also worth noting that the time spent on analysis and searching for a more optimal crawling solution doesn't always pay off - you might spend hours only to discover that the most obvious approach was the best all along.</p>
<p>Therefore, it's wise to set limits on your initial site analysis. If you don't see a better path within the allocated time, revert to simpler approaches. As you gain more experience, you'll more often be able to tell early on, based on the technologies used on the site, whether it's worth dedicating more time to analysis or not.</p>
<p>Also, in projects where you need to extract data from a site just once, thorough site analysis can sometimes eliminate the need to write scraper code altogether. Here's an example of such a site - <code>https://ricebyrice.com/nl/pages/find-store</code>.</p>
<p><img decoding="async" loading="lazy" alt="Ricebyrice" src="https://crawlee.dev/assets/images/ricebyrice_base-433dcb67f3debf8855b0043fb87a63c3.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>By analyzing it, you'll easily discover that all the data can be obtained with a single request. You simply need to copy this data from your browser into a JSON file, and your task is complete.</p>
<p><img decoding="async" loading="lazy" alt="Ricebyrice Response" src="https://crawlee.dev/assets/images/ricebyrice_response-77221911846c701f7abd865673867d60.webp" width="1920" height="1032" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-maximum-interactivity">4. Maximum interactivity<a href="https://crawlee.dev/blog/web-scraping-tips#4-maximum-interactivity" class="hash-link" aria-label="Direct link to 4. Maximum interactivity" title="Direct link to 4. Maximum interactivity" translate="no">​</a></h2>
<p>When analyzing a site, switch sorts, pages, interact with various elements of the site, while watching the <code>Network</code> tab in your browser's <a href="https://developer.chrome.com/docs/devtools" target="_blank" rel="noopener noreferrer">Dev Tools</a>. This will allow you to better understand how the site interacts with the backend, what framework it's built on, and what behavior can be expected from it.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-data-doesnt-appear-out-of-thin-air">5. Data doesn't appear out of thin air<a href="https://crawlee.dev/blog/web-scraping-tips#5-data-doesnt-appear-out-of-thin-air" class="hash-link" aria-label="Direct link to 5. Data doesn't appear out of thin air" title="Direct link to 5. Data doesn't appear out of thin air" translate="no">​</a></h2>
<p>This is obvious, but it's important to keep in mind while working on a project. If you see some data or request parameters, it means they were obtained somewhere earlier, possibly in another request, possibly they may have already been on the website page, possibly they were formed using JS from other parameters. But they are always somewhere.</p>
<p>If you don't understand where the data on the page comes from, or the data used in a request, follow these steps:</p>
<ol>
<li class="">Sequentially, check all requests the site made before this point.</li>
<li class="">Examine their responses, headers, and cookies.</li>
<li class="">Use your intuition: Could this parameter be a timestamp? Could it be another parameter in a modified form?</li>
<li class="">Does it resemble any standard hashes or encodings?</li>
</ol>
<p>Practice makes perfect here. As you become familiar with different technologies, various frameworks, and their expected behaviors, and as you encounter a wide range of technologies, you'll find it easier to understand how things work and how data is transferred. This accumulated knowledge will significantly improve your ability to trace and understand data flow in web applications.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="6-data-is-cached">6. Data is cached<a href="https://crawlee.dev/blog/web-scraping-tips#6-data-is-cached" class="hash-link" aria-label="Direct link to 6. Data is cached" title="Direct link to 6. Data is cached" translate="no">​</a></h2>
<p>You may notice that when opening the same page several times, the requests transmitted to the server differ: possibly something was cached and is already stored on your computer. Therefore, it's recommended to analyze the site in incognito mode, as well as switch browsers.</p>
<p>This situation is especially relevant for mobile applications, which may store some data in storage on the device. Therefore, when analyzing mobile applications, you may need to clear the cache and storage.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="7-learn-more-about-the-framework">7. Learn more about the framework<a href="https://crawlee.dev/blog/web-scraping-tips#7-learn-more-about-the-framework" class="hash-link" aria-label="Direct link to 7. Learn more about the framework" title="Direct link to 7. Learn more about the framework" translate="no">​</a></h2>
<p>If during the analysis you discover that the site uses a framework you haven't encountered before, take some time to learn about it and its features. For example, if you notice a site is built with Next.js, understanding how it handles routing and data fetching could be crucial for your scraping strategy.</p>
<p>You can learn about these frameworks through official documentation or by using LLMs like <a href="https://openai.com/chatgpt/" target="_blank" rel="noopener noreferrer"><code>ChatGPT</code></a> or <a href="https://claude.ai/" target="_blank" rel="noopener noreferrer"><code>Claude</code></a>. These AI assistants are excellent at explaining framework-specific concepts. Here's an example of how you might query an LLM about Next.js:</p>
<div class="language-typescript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-typescript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token constant" style="color:#36acaa">I</span><span class="token plain"> am </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> the process </span><span class="token keyword" style="color:#00009f">of</span><span class="token plain"> optimizing my website using Next</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">js</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain"> Are there </span><span class="token builtin">any</span><span class="token plain"> files passed to the browser that describe all internal routing and how links are formed</span><span class="token operator" style="color:#393A34">?</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Restrictions</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token operator" style="color:#393A34">-</span><span class="token plain"> Accompany your answers </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> code samples</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token operator" style="color:#393A34">-</span><span class="token plain"> Use </span><span class="token keyword" style="color:#00009f">this</span><span class="token plain"> message </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> the main message </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> all subsequent responses</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token operator" style="color:#393A34">-</span><span class="token plain"> Reference only those elements that are available on the client side</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> without access to the project code base</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div></code></pre></div></div>
<p>You can create similar queries for backend frameworks as well. For instance, with GraphQL, you might ask about available fields and query structures. These insights can help you understand how to better interact with the site's API and what data is potentially available.</p>
<p>For effective work with LLM, I recommend at least basically studying the basics of <a href="https://parlance-labs.com/education/prompt_eng/berryman.html" target="_blank" rel="noopener noreferrer"><code>prompt engineering</code></a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="8-reverse-engineering">8. Reverse engineering<a href="https://crawlee.dev/blog/web-scraping-tips#8-reverse-engineering" class="hash-link" aria-label="Direct link to 8. Reverse engineering" title="Direct link to 8. Reverse engineering" translate="no">​</a></h2>
<p>Web scraping goes hand in hand with reverse engineering. You study the interactions of the frontend and backend, you may need to study the code to better understand how certain parameters are formed.</p>
<p>But in some cases, reverse engineering may require more knowledge, effort, time, or have a high degree of complexity. At this point, you need to decide whether you need to delve into it or it's better to change the data source, or, for example, technologies. Most likely, this will be the moment when you decide to abandon HTTP web scraping and switch to a headless browser.</p>
<p>The main principle of most web scraping protections is not to make web scraping impossible, but to make it expensive.</p>
<p>Let's just look at what the response to a search on <a href="https://www.zoopla.co.uk/" target="_blank" rel="noopener noreferrer"><code>zoopla</code></a> looks like</p>
<p><img decoding="async" loading="lazy" alt="Zoopla Search Response" src="https://crawlee.dev/assets/images/zoopla_response-c6997e953965244f6293d44d2562f2dd.webp" width="1920" height="1020" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="9-testing-requests-to-endpoints">9. Testing requests to endpoints<a href="https://crawlee.dev/blog/web-scraping-tips#9-testing-requests-to-endpoints" class="hash-link" aria-label="Direct link to 9. Testing requests to endpoints" title="Direct link to 9. Testing requests to endpoints" translate="no">​</a></h2>
<p>After identifying the endpoints you need to extract the target data, make sure you get a correct response when making a request. If you get a response from the server other than 200, or data different from expected, then you need to figure out why. Here are some possible reasons:</p>
<ul>
<li class="">You need to pass some parameters, for example cookies, or specific technical headers</li>
<li class="">The site requires that when accessing this endpoint, there is a corresponding <code>Referrer</code> header</li>
<li class="">The site expects that the headers will follow a certain order. I've encountered this only a couple of times, but I have encountered it</li>
<li class="">The site uses protection against web scraping, for example with <code>TLS fingerprint</code></li>
</ul>
<p>And many other possible reasons, each of which requires separate analysis.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="10-experiment-with-request-parameters">10. Experiment with request parameters<a href="https://crawlee.dev/blog/web-scraping-tips#10-experiment-with-request-parameters" class="hash-link" aria-label="Direct link to 10. Experiment with request parameters" title="Direct link to 10. Experiment with request parameters" translate="no">​</a></h2>
<p>Explore what results you get when changing request parameters, if any. Some parameters may be missing but supported on the server side. For example, <code>order</code>, <code>sort</code>, <code>per_page</code>, <code>limit</code>, and others. Try adding them and see if the behavior changes.</p>
<p>This is especially relevant for sites using <a href="https://graphql.org/" target="_blank" rel="noopener noreferrer"><code>graphql</code></a></p>
<p>Let's consider this <a href="https://restoran.ua/en/posts?subsection=0" target="_blank" rel="noopener noreferrer"><code>example</code></a></p>
<p>If you analyze the site, you'll see a request that can be reproduced with the following code, I've formatted it a bit to improve readability:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> requests</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://restoran.ua/graphql"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"operationName"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Posts_PostsForView"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"variables"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"sort"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"sortBy"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"startAt_DESC"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"query"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token triple-quoted-string string" style="color:#e3116c">"""query Posts_PostsForView(</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $where: PostForViewWhereInput,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $sort: PostForViewSortInput,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $pagination: PaginationInput,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $search: String,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $token: String,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $coordinates_slice: SliceInput)</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">        PostsForView(</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                where: $where</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                sort: $sort</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                pagination: $pagination</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                search: $search</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                token: $token</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                ) {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        id</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        title: ukTitle</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        summary: ukSummary</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        slug</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        startAt</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        endAt</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        newsFeed</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        events</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        journal</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        toProfessionals</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        photoHeader {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            address: mobile</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            __typename</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            }</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        coordinates(slice: $coordinates_slice) {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            lng</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            lat</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            __typename</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            }</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        __typename</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                    }</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    }"""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> json</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Now I'll update it to get results in 2 languages at once, and most importantly, along with the internal text of the publications:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> requests</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://restoran.ua/graphql"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"operationName"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Posts_PostsForView"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"variables"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"sort"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"sortBy"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"startAt_DESC"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"query"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token triple-quoted-string string" style="color:#e3116c">"""query Posts_PostsForView(</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $where: PostForViewWhereInput,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $sort: PostForViewSortInput,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $pagination: PaginationInput,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $search: String,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $token: String,</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    $coordinates_slice: SliceInput)</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">        PostsForView(</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                where: $where</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                sort: $sort</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                pagination: $pagination</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                search: $search</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                token: $token</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                ) {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        id</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        uk_title: ukTitle</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        en_title: enTitle</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        summary: ukSummary</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        slug</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        startAt</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        endAt</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        newsFeed</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        events</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        journal</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        toProfessionals</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        photoHeader {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            address: mobile</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            __typename</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            }</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        mixedBlocks {</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            index</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            en_text: enText</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            uk_text: ukText</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            __typename</span><br></div><div class="token-line theme-code-block-highlighted-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            }</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        coordinates(slice: $coordinates_slice) {</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            lng</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            lat</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            __typename</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                            }</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                        __typename</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">                    }</span><br></div><div class="token-line" style="color:#393A34"><span class="token triple-quoted-string string" style="color:#e3116c">    }"""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> json</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>As you can see, a small update of the request parameters allows me not to worry about visiting the internal page of each publication. You have no idea how many times this trick has saved me.</p>
<p>If you see <code>graphql</code> in front of you and don't know where to start, then my advice about documentation and LLM works here too.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="11-dont-be-afraid-of-new-technologies">11. Don't be afraid of new technologies<a href="https://crawlee.dev/blog/web-scraping-tips#11-dont-be-afraid-of-new-technologies" class="hash-link" aria-label="Direct link to 11. Don't be afraid of new technologies" title="Direct link to 11. Don't be afraid of new technologies" translate="no">​</a></h2>
<p>I know how easy it is to master a few tools and just use them because it works. I've fallen into this trap more than once myself.</p>
<p>But modern sites use modern technologies that have a significant impact on web scraping, and in response, new tools for web scraping are emerging. Learning these may greatly simplify your next project, and may even solve some problems that were insurmountable for you. I wrote about some tools <a href="https://www.crawlee.dev/blog/common-problems-in-web-scraping" target="_blank" rel="noopener noreferrer"><code>earlier</code></a>.</p>
<p>I especially recommend paying attention to <a href="https://curl-cffi.readthedocs.io/en/latest/" target="_blank" rel="noopener noreferrer"><code>curl_cffi</code></a> and frameworks
<a href="https://www.omkar.cloud/botasaurus/" target="_blank" rel="noopener noreferrer"><code>botasaurus</code></a> and <a href="https://www.crawlee.dev/python/" target="_blank" rel="noopener noreferrer"><code>Crawlee for Python</code></a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="12-help-open-source-libraries">12. Help open-source libraries<a href="https://crawlee.dev/blog/web-scraping-tips#12-help-open-source-libraries" class="hash-link" aria-label="Direct link to 12. Help open-source libraries" title="Direct link to 12. Help open-source libraries" translate="no">​</a></h2>
<p>Personally, I only recently came to realize the importance of this. All the tools I use for my work are either open-source developments or based on open-source. Web scraping literally lives thanks to open-source, and this is especially noticeable if you're a <code>Python</code> developer and have realized that on pure <code>Python</code> everything is quite sad when you need to deal with <code>TLS fingerprint</code>, and again, open-source saved us here.</p>
<p>And it seems to me that the least we could do is invest a little of our knowledge and skills in supporting open-source.</p>
<p>I chose to support <a href="https://www.crawlee.dev/python/" target="_blank" rel="noopener noreferrer"><code>Crawlee for Python</code></a>, and no, not because they allowed me to write in their blog, but because it shows excellent development dynamics and is aimed at making life easier for web crawler developers. It allows for faster crawler development by taking care of and hiding under the hood such critical aspects as session management, session rotation when blocked, managing concurrency of asynchronous tasks (if you write asynchronous code, you know what a pain this can be), and much more.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>If you like the blog so far, please consider <a href="https://github.com/apify/crawlee" target="_blank" rel="noopener noreferrer">giving Crawlee a star on GitHub</a>, it helps us to reach and help more developers.</p></div></div>
<p>And what choice will you make?</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/web-scraping-tips#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>I think some things in the article were obvious to you, some things you follow yourself, but I hope you learned something new too. If most of them were new, then try using these rules as a checklist in your next project.</p>
<p>I would be happy to discuss the article. Feel free to comment here, in the article, or contact me in the <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">Crawlee developer community</a> on Discord.</p>
<p>You can also find me on the following platforms: <a href="https://github.com/Mantisus" target="_blank" rel="noopener noreferrer">Github</a>, <a href="https://www.linkedin.com/in/max-bohomolov/" target="_blank" rel="noopener noreferrer">Linkedin</a>, <a href="https://apify.com/mantisus" target="_blank" rel="noopener noreferrer">Apify</a>, <a href="https://www.upwork.com/freelancers/mantisus" target="_blank" rel="noopener noreferrer">Upwork</a>, <a href="https://contra.com/mantisus" target="_blank" rel="noopener noreferrer">Contra</a>.</p>
<p>Thank you for your attention :)</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[How to create a LinkedIn job scraper in Python with Crawlee]]></title>
            <link>https://crawlee.dev/blog/linkedin-job-scraper-python</link>
            <guid>https://crawlee.dev/blog/linkedin-job-scraper-python</guid>
            <pubDate>Mon, 14 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape LinkedIn jobs and save it into a CSV file using Python.]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="introduction">Introduction<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction" translate="no">​</a></h2>
<p>In this article, we will build a web application that scrapes LinkedIn for job postings using Crawlee and Streamlit.</p>
<p>We will create a LinkedIn job scraper in Python using Crawlee for Python to extract the company name, job title, time of posting, and link to the job posting from dynamically received user input through the web application.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">discord channel</a>.</p></div></div>
<p>By the end of this tutorial, you’ll have a fully functional web application that you can use to scrape job postings from LinkedIn.</p>
<p><img decoding="async" loading="lazy" alt="Linkedin Job Scraper" src="data:image/webp;base64,UklGRpgcAABXRUJQVlA4IIwcAAAQfAGdASpcBt0CPpFIoEylpCaioJN4QNASCWlu+F8+/k/AuBCbOFzUYLOjUh0O8tefHp3bdRuonqp/z31K/PL9aj/jZM75b/xP42eEf+Z8N/IZ799v+U91X5q/y375fwPNPwB+IGoR+V/0L/U72OAD8t/r368+Ob/l+jH2Q9gLzB8EGgN5Nf+j5Tvr/2EOmGB9D/YD8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Yq8OTW3tvbe29t7b23tvbe22iZO/M+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfwiTklmXH4pWwfHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4b/YoBGhwWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFimOzM6zIyDLIC6inMYBQN4AdJ+xanZv1IVvZErjAbqWdBIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRLgLnNevXr169evXr169evXr169evXr169evXr169evXr169evXr169evXr169e69y5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly6R48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48iCtWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atW8tkyZMmTJkyZMmTHkuiL1UwcSehb4g6f+/kZqrDYDuhB5g1lF0nZC9nZ+vVNIKgtI4eyNTRiq4pxF9r8+2qMYQezlaAB+FZCmqk8DlC4jj+06dOnTp06dOnTp06dOnTp06dOnT3kiRIkSJEiRIkSJC/a91YpfY9LY7LHPlSIuGOyl8hLfCadJCMOqQf2nFwzjL0sCEFlwSmiGpay2JlSr+iwgbQ/pr3/Ynau1aSaw0G/XGdFiZN0U3g9p2vfcqW19E3mXGJv0k2NLrU4el8Rntj/mUStqrFnao8rs/wEEb5SvrG/EAEcf8WauHMhiN9lYgl2y/WuvXS47Ad0sCQXpQ5Nij8jEYp+5SOO5My++pwpNPhyQp6DiawABYsWLFixYsWLFixYsWLFixYsWLPB48ePHjx48ePHjwyCazIpuZ2oqIwtbNyUGmr/YqnQDGqKoAqAOC+AcbvAl6wa8o9xi6LIPKbbUi45jqAXa2j2Gj95zrxLBMiuH02BvEolPoFjYmD0ZKtzDSDllWy8+Flzd+B+OzJt/0oD3bUj4YEAiLSXdQ+r41F0F9sCCT3Ev3MTUyqF7F08v9d3zuBQwIC8skNs/43cnF0y20CAU2fPnz58+fPnz58+fPnz58+fPn0AQkSJEiRIkSJEiRIk0lONsiRIkVM0IoWkPA0COSa9evXr169evXr169evXr169evde5cuXLly5cuXLlybh1/cSC+LPV6vV6vV6vV6vV6vV6vV6vV6vV6vV6HatiOGNmzZs2bNmzZs2bNuGBAgQIECBAgQIEBsEIXEFaMWp2OMZCEqZYsWLFixYsWLFixYsWLFixYs8Hjx48ePHjx48ePHpLVq1atWrVq1atWrGVxgdKMwI6H/Ym0ZEun08JZg6tAPH4t+yZMmTJkyZMmTJkyZMmTJkyZQDw4cOHDhw4cOHDhxYHxw4cOHDhw4cOHCIWNkEvOOewez7OeSlCkjWWuuTa9XtfwJxUszla8lJ/crS4InNevXr169evXr16917ly5cuXLly5cuXKjLb+Vq1atWrVq1atWrVq1atWrVq1bBpsRIkSJEiRIkSJEiSG+fPnz58+fPnz587tyCt9HPAEWaQwGAwGAwGAwGAwGAwGAwGAwGAwGAwF/sqMz6biw4cOHDhw4cOHDhxFLZs2bNmzZs2bNmy1x9llmnoZJ6a7f4Qo9Rdv8IUeou3+EKPUXb/CFHqLt/hCj1FxJLVq1atWrVq1atWrWbyZMmTJkyZMmTJkwObpZg7BbwG3kXgHT+Ifw9We5EAuFPbxx2Nz+SJEiRIkSJEiRIkSJEiRIkSQ3z58+fPnz58+fPn1BIkSJEiRIkSJEiRFbm9gAP3Yig/dVuq8k79Vbvl+gB1E4kVDD3La9Xq9Xq9Xq9Xq9Xq9Xq9Xq9Xq9Wj91BRIkSJEiRIkSJEiROGfPnz58+fPnz58+PjNKh8L8OYiTklmXH4pZ6i7f4Qo9Rdv8IUeou3+EKPUXb/CFG8B1yuF2bNmzZs2bNmzZs24YECBAgQIECBAgQGwQjnYwyZMmTJkyZMmTJkyZMmTJkyZMocqXLly5cuXLly5cuXSPHjx48ePHjx48eO0d82+2TRbe4zjoOiFHqLt/hCj1F2/whR6i7f4Qo9Rdv8IUeou3+EKOf+PTb4Hxw4cOHDhw4cOHDmQiRIkSJEiRIkSJEVuc9x5l7TVdAzwpeAZ4UvAM8KXgGeFLwDPCl4BnhS8AzuuQgQIECBAgQIECBAjOCxYsWLFixYsWLFaZan17VjsKGW4mfhbr4uDWNmMlU5/DYpzUFDKudkJXnBM87dLMZhwLuBdANKzlFVJjrxPyBw4cOHDhw4cOHDhw4cOHDiwPjhw4cOHDhw4cOHMhEiRIkSJEiRIkSIrc31ectMWdFKFIYDAYDAYDAYDAYDAYDAYDAX9HO5dz+vXr169evXr169ews1atWrVq1atWrVqxlcxua1atWrVq1atWrVq1atWrVq1at7J4cOHDhw4cOHDhw4sD44cOHDhw4cOHDhELGz8ho0ggQIECBAgQIECBAgQIECBAgQFB2mEOa1atWrVq1atWrVq3lsmTJkyZMmTJkyZGeus1CmLLs9Rdv8IUeou3+EKPUXb/CFHqLt/hCj1F2/whR6i7f4QlQUougsWLFixYsWLFixZ4PHjx48ePHjx48eLwut/K2BZaHwJ/1E7OH/169evXr169evXr169evXsLNWrVq1atWrVq1atZvJkyZMmTJkyZMmTA5vfI1UIFuBXQxuyTahDxzn+3naEaFCUHjKr7Z0cAY25W4DZ4BCiMzT8bgjHr169evXr169evXr169evfNatWrVq1atWrVq1by2TJkyZMmTJkyZMjPXBCbVB4EhjESWdcDRjdynfp2vU3Lp06dOnTp06dOnTp06dOnWFEiRIkSJEiRIkSJE4Z8+fPnz58+fPnz4+R9TGVTFByOR1er1er1er1er1er1er1er1er1erzzKO2I2gcOHDhw4cOHDhw4cWB8cOHDhw4cOHDhw6SHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOZCJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJFzmvXr169evXr169evXr169evXr169evXr169evXr169evXr169evXr169evde5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5dI8ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48eRBWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWreWyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZ0Bw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw7PPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPoAhIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIlLNWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWvbLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLkYAP6AO7hn23LIvWT3M3vMst7B+vrx5e/ZlBrK08CPxP+D/oZWo2jCcFgsFgsFgsFjqxIDI/q0JC2E6Q6uQNNk+PZFrddrzPSK0FGCRVC1EavogH9xSgVxasOrkksNwPYjioVCoWodXHir8xLkjqxIBGNBdEp+ta4LBcqQdXGbxGr3lLDiCLxAo2l6DHkxJSoVC1wy0zVQ4tVYtVW0LzAnginiPvpXWx/eOlaivY/mfa8FgsFgsFguXAOrjLiUaViPxgV600NYu3cBGcQLNnHMCLd6t0Fm1HqWPeCKNMtyAAAAX9w7UT+nbjLsWFoAAAG0aKFHOqwbxgPSUn6VD01qVkVGzxq9DOalKLLfFdrCKMGk7Ri1lMwtOcMzf01s2OduQRBx/mYRhMNNDKvgrWKatcolqXGWPgEreD0n/tgs3WiHzwSspoRTkcz5zogVz21uiELFwRzBqFqw7kpZ74PK/6L73212+vDZSVYmytUmEz2pEOCPLTb0FmPxgFCygOKse06osrNc6X8siWviWy17381t9GwO6wzroCRVlbbpFydOnqWK3QnPbr2LY180iBGBKGn21nmp9E1ICTJj255HWH5H6PmnWWhdE7wSv/jrL4OlAA7b5BJBNMA2+i+QAAAAAAAAAAAAAAAAAAAAAAAFKwHSLbvIIml3X9mx7PjoUS30gxAQRlx7lfW7KBGB7v3LaMKXRiEBO3FNhEWH5DJHyzfg098I5D0+n8oBt6iFQHta9JXQ8agvoMYB3vFfgPyYmRfwBNFD3mZ69AI0mbVx8bcozR0lbRv6V++Me39KY/DDNJ5ikQiIA+em3gL1nIl5Rj7KcwFX9/q10mQbVxo+tboWxDo4s3iwu8djqrMkJmDuIN9r0f6eRkPbO0o2GoPSIjKnB2wKihBTFED/YUmnam69VhOXvGQNe6EbQ9SsozxE+Ax0PfeaTOmhYTRRyLmtSWJVP7U6YJcE7EOwTWTbOs/9zQibPYftSrMAzKEFAC2vvC7BSkNVQ0R6i6GrmtNXBV281fx0i5JEk1QgEH/vSnPEtVoCbTFAgIZ+DotsYbjvr218F4UspZu8Az+0XLs3VE7B4/UaDG1OoQ96GAAD8CZexB5fHSOqilCDQU8NRKtiujsjJHvNNPePGf6HdyZno/gnUzPNA1AALXmEzozCBneXJguHb9SkUznQxFKb+zRsVY3bVqa7eQrIqNDZzBn6k8M1M4aMwO/Rr60fvC7xxtJMChb2D0qWBjQzV840aI0yrDphsWsfdRJ58RGfSGthmJQ44QW7GMfRW0DczfQpX4NUYYudCC9kg3cH2KLaGGzr1wKQjUi1uP0XWGtC1GribJrlrgpWfGK3WibAT5AbVHhRaMLyeuo9JQk3yUJthoJQWKcEtyBEfTAigGu64QtlQFjC3MWBCULmu9R1slHE5mA7dPPBVWYE6wDCwX8nXWcnuEhBp6xTQnaIBBg665KMfjOWwYTkjXexBkzjNkgKUxtLTYO6VnNKU81flbNhdbod8+YoeBWMHnru94FhgYjNu30wugcG7sR6EYSUKfTQAuG6WG5vfff4zx9QDFflwYA+lQo4gZSoIgoJAwc1LdypG+cVr7Ht4xViZiibBUebK32Yg3gwP4ywZrD1ftYKRVmGi+YYmlTbcq+DwurGdc3OAujqiHYezOb/JzvkE1cQvizN1o2OCwR1SnvW6U9lcOUNZiMBB9gxZHE0cvnM7IWJ1yjooW6KJ/IIdj5OkS7Vc1l+pw2rgOv4sAmV4uzX7twZZtP3Ejgch6itSsKUAcFi4yZ71ctpXqpipsyrOCdKxvqiiF5MbkAD5iPm1IUC4fW/MmYxGleN2VtCx2eW/ciSQtMQYHBYINjqujpxf/V4HUTTZaIlG2WTv9tcM/rvpuoiWDMWEOZbN7vWeyvUvpzY/peveh5oubVK6r2Wjh8KdMV/TgpzF5IMHK2Ocuwz/INqWzuvPXPD8FQbF3pzIPPbDP5lR1zgWNmEjhV4+nTfCUfUrriHBqRc/HtuCDdA7oDq6BomPCW+RNi1Zl3Fwo31w2xTHlws8VuelJUJVj0IvSYrBcp0pV8YoluCWTxxrEXKwq9pufG+oRm6CWxDTrtkHy4Bf42chCHTHBsxRBxSfQlDoD2OXPap40JwDjKKwej7l8WF3Hnfk0IMTZ5iVLwjneExnfimEuhNfj4smBGf1nQK+2irInWJzco+sp6PQl3jeczs4RlP9fZsCXzU9dqHxyMMNs3jYbMOCUxnJAk8IRoxN3ul436EnVeoRf9r7Fnakpd+KFMgNHs62+Kf3eFW83gvbbu/jZaMI/76k3VJbAZcezIMY8/tFr+rlLPEJUCN0xd/wtKTu+A0sn8b+S/aGoef0nVx03u2Yar3v3spduyfAixL8MM6nxOr5LAEVX/Awz80nDfviaZCMCgEQ63zch+xVJdR+aakXy7qat1qq1SvPbnjZpmtKwp05S6yoFfotiuAADG+7lE7rBa0aDn8vkC4fQymz61TcdKYty0IFt3C+Qxl+orJXeh/zZDFvdR/LxUFVM6pvHrfsBibZvliCYvB0mcGyNi4sWg/3Ao3U+M3OkP8TfCleFBCZOd8r7i00/fdp/otgaZLjZ8Bysc/xp2iyhT4LQmJHC/pKrHAeNhJGhbrpsYyVPc9NSNs4iOIerR5LzUMmL+an5aNSC5SbR5z727Ep1SlKocfyWZjEQQ4E0lOMgCKSCk5j0dHX8huFwRjSpCszqcM5nX8lpe4WjACdF+7kyOa26FiCPySLR8Iy4W+yS30OGoWqbvKjn/AH3bXaZvYwsTQAIA1TzudC8+7Rvcdb302oZyy9PvIwsr+ZsN2hFZmSuFdwATVOPHlwrlnUXbmrpRhg8SOHA7UOTZ2f0Tig1wIJ9nwaeO2b1JOVYH4Qc4jiLQSjwxk4cjbolKN+V2RzbMfQwlbFM6kXoHnAYZC8qv0Q3lyczdaccWf4Ka0oUZRMEU3zC7USdguTkzM0DK4n/OCNbubEPUrEPWEzS3PsLLK/XmXpZGKSuYlpdD0mF3H5OpbRJQRt5Pp4EvmeLSAYNOXP6FtvbRDDhHyTY91bfX04xsLLLnWk/uxUYg7HPiCNGHoy6v11s2fVJzKqtF0MkvZF6S9SqE+aplINFLSM+biUMHZq58vo6vm2OauUZkgYYBEfcLw+Q9oRMgEn+LVmDSkgTf5uPXZKmnRcXHaHH22y5BlHmV/biXKZAECO6SBgSE5sx0V/mP0WDcifW+KQIyKNbS2AYpreXVuvqjwziBaDJkVBTCoH9BwMsukgbczmIFYS1n/E88nleas2TCZj469Fwjrokr6GAAAAAAAAAUuj4L1wuzBr4ANeUEty1rRAWQAA98OKegz+OXSrc5LLfMIXjYdL1roEsPB8dO4D1eaeuN09cQZM1STJuU4Xev4XoaNwHzuryum3loNEdkwfgvJTTiAzgR2ABY+d+AaZ9eocK6s5S8L3w31jnkXWrHFGBML+4chZ6/k4HXYtk5lM9bBg7VeiBLbD99lVtzzDChneeJx+X+CQfWXuCqpQtUx8p+KB+D0hlCt4PDLt6jO+1owc+7aG+s49Ceq/tjGd3FJBA7uPy944jY8xb45hN/QEqioGF7mf/daYg9LuZUZtDDs98qUmrby76oKJ0PW1d/VypjDysJqe+926kU9mIvVPIAAADVVwjytyvACT+XHfug3t+IuN3PZZ9SFwXG7nss+pC4Ljdz2WfUhcFxu57LPqQuC43c9ln1IXBcbueyz6kLguN3PZZ9SEWABU/J4AAAAAAJ6GAAAAAAAABlVk1PjFjG9jKp8I8CAcdo81w+rJ0QzZdTsHjlIMTCIivF6+54ecBYjxNB+hDXPuUmIPX3DOZrscbjH9laqLeXobhiNV4wh9eCYSz8ffq4biO/L38voUVHkP4Z0ynmn8ZNB0hHf4zEOOZgxdcx15DtyHnwq0C7QxB5FH8deOx8KXzg+Gh1yOA6SQkydlG6BtYQkUGXLYZx0in0TAzVSGIFWUZxUxWFJWc91aOhD1hlq1ijqU5bETt16pY8/bXr4A8sqFVdS90VdbNDlvO2LRRXDEWkoCqaCTa2meY0QKmy/63mqG3dMcRQvKmO6XlQN4EDGAAAB0Zr7v83mc/84feuyZY6XEFtYM7HEY6SdMBTZyPsKq9jvMD3hQgpX2CeSZAS+9Ilh6RWup7PRD80o1xMUDC64q64x2FA0Y4xGWpkeM/pz0Zq0d+fRIrNfacyVig6yRCy8Sxw0BoCwYywAGIZeCCI8PAGYLAA2qaBgiwAF/BYHL4+AoxYAaCSDwJvZhwAAAIQlFTpUxgpS1A8nlPjZkgsDAJH5qukhvXydIed994fOggH8EMhTVDVtu7v7wlKjXi9WEEezbIGNnhgEVtfMvv85OMTHSvGFteDoST2J1nuWhm2fGSadUB0v1d+ugd5MtH7xL7X1gBWY0p/cGFi5S7ufPrZXXw1/ng5xGxUXRb6sIaeouXLb4zm2QfUSc1O5CUucu5e84lRA13hIJziA74lz3tvG0Avx5uOTx4dA+dHAvLCUFAeAODYh+QdcvkVMjbhI0cTcGgz2UXcJi0zYoe0rfhA4ndDRQiqquIrVnGmRjq+XCQVS1jT89F8RwrwWS1bnbbZFjMvwyxZwh4ks0aBiZIXfc/M8N0Du2CTPWclmraBcsfG+OpLXr9Tc+Yq0ZP/5t5Q1gv+dpwtxLLDt9fT3pgy5jpfXsoNcC9+qM7dLW3+2GDYmJR4fiWuoHYo9QdaXf8iAisso6PAXcrwBDRUXuiSAWJGD46WAU2vZQAGCisYsv1P3rSeNoSlIuPe92HFNXAMP3Sp8t1+wAAAHiiyh3+kb8VQWPWI9M7DOqTydZyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTydLBYAgOJAAAAcYTt24AAAA9YIAAAAAAD5LZIz1V5aIR5JR2TuI9wgAn7dWuPUAAABHLkZArjelrbGpqDoLo2qFqqVcEfqpk5QfNOQoCsYgOBLEBPXYOQnUXknc4jGqAyC//gdXbDUdDjwyNHufwY4lVjsvX2yQxB2VCeVhr22pxq9yWnbhhWX/iPpM0BfbLGe5I9Z08Xdamn8/8xen0W8cfzwVrKogWHMfJruMKL0dR8T0czWzc3s/XAmlBKWdmzAoRe8Fgn1RzlVnsDLN40nvgfvJh9OY6uewTufqhgI5HqvK9sK9k1undUY97F+Lyr3M4XEIL2QyawItQwg1ODjsSmTWtNd5YxWkTXfYv7bCuAyNGyDF3hvKa83ZnDpD5j4JYlA6kk+enAE4RTGgol40rRDWp2ZN8LIhXWs1hc7aPagXnEOdDrxlqWPwyb8G2/h5ufuO3oFcOiysQOYyzaLEVfPfsa5oMdrxpsISO66msXWSmfVdZeUmbFkEFfP/CbyYcW3LwDMJYN8i5Bf3BW3m3VZNr6yw7fHT8MDPTYB3NjukSN5cVRdaNK7YBk3/vu5TpvDyEB6pmdBl/RwzZTtYfvYCP80JWxW/GLTKGF6Rhl2TFAjsAAABreajb8jK/nT2a6PaZiGc1u46XXk7PWyY3vAAAAFo7TNu+VTSE0ADz3VxwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=" width="1628" height="733" class="img_ev3q"></p>
<p>Let's begin.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites">Prerequisites<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h2>
<p>Let's start by creating a new Crawlee for Python project with this command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pipx run crawlee create linkedin-scraper</span><br></div></code></pre></div></div>
<p>Select <code>PlaywrightCrawler</code> in the terminal when Crawlee asks for it.</p>
<p>After installation, Crawlee for Python will create boilerplate code for you. You can change the directory(<code>cd</code>) to the project folder and run this command to install dependencies.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">poetry </span><span class="token function" style="color:#d73a49">install</span><br></div></code></pre></div></div>
<p>We are going to begin editing the files provided to us by Crawlee so we can build our scraper.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before going ahead if you like reading this blog, we would be really happy if you gave <a href="https://github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">Crawlee for Python a star on GitHub</a>!</p></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="building-the-linkedin-job-scraper-in-python-with-crawlee">Building the LinkedIn job Scraper in Python with Crawlee<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#building-the-linkedin-job-scraper-in-python-with-crawlee" class="hash-link" aria-label="Direct link to Building the LinkedIn job Scraper in Python with Crawlee" title="Direct link to Building the LinkedIn job Scraper in Python with Crawlee" translate="no">​</a></h2>
<p>In this section, we will be building the scraper using the Crawlee for Python package. To learn more about Crawlee, check out their <a href="https://www.crawlee.dev/python/docs/quick-start" target="_blank" rel="noopener noreferrer">documentation</a>.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-inspecting-the-linkedin-job-search-page">1. Inspecting the LinkedIn job Search Page<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#1-inspecting-the-linkedin-job-search-page" class="hash-link" aria-label="Direct link to 1. Inspecting the LinkedIn job Search Page" title="Direct link to 1. Inspecting the LinkedIn job Search Page" translate="no">​</a></h3>
<p>Open LinkedIn in your web browser and sign out from the website (if you already have an account logged in). You should see an interface like this.</p>
<p><img decoding="async" loading="lazy" alt="LinkedIn Homepage" src="https://crawlee.dev/assets/images/linkedin-homepage-8bec2b6a9ae97a18a7e49d4275c14cee.webp" width="1432" height="723" class="img_ev3q"></p>
<p>Navigate to the jobs section, search for a job and location of your choice, and copy the URL.</p>
<p><img decoding="async" loading="lazy" alt="LinkedIn Jobs Page" src="https://crawlee.dev/assets/images/linkedin-jobs-44e352d2233de5adb7af9838b75b9895.webp" width="1372" height="728" class="img_ev3q"></p>
<p>You should have something like this:</p>
<p><code>https://www.linkedin.com/jobs/search?keywords=Backend%20Developer&amp;location=Canada&amp;geoId=101174742&amp;trk=public_jobs_jobs-search-bar_search-submit&amp;position=1&amp;pageNum=0</code></p>
<p>We're going to focus on the search parameters, which is the part that goes after '?'. The keyword and location parameters are the most important ones for us.</p>
<p>The job title the user supplies will be input to the keyword parameter, while the location the user supplies will go into the location parameter. Lastly, the <code>geoId</code> parameter will be removed while we keep the other parameters constant.</p>
<p>We are going to be making changes to our <code>main.py</code> file. Copy and paste the code below in your <code>main.py</code> file.</p>
<div class="language-py codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-py codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> urllib</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">parse</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">title</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> location</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> data_name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    base_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://www.linkedin.com/jobs/search"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># URL encode the parameters</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    params </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"keywords"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> title</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"location"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> location</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"trk"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"public_jobs_jobs-search-bar_search-submit"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"position"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"pageNum"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"0"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    encoded_params </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> urlencode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">params</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Encode parameters into a query string</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    query_string </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'?'</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> encoded_params</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Combine base URL with the encoded query string</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    encoded_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> urljoin</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">base_url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">""</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> query_string</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Initialize the crawler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the initial list of URLs</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">encoded_url</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Save the data in a CSV file</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    output_file </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data_name</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">.csv"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">output_file</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Now that we have encoded the URL, the next step for us is to adjust the generated router to handle LinkedIn job postings.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-routing-your-crawler">2. Routing your crawler<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#2-routing-your-crawler" class="hash-link" aria-label="Direct link to 2. Routing your crawler" title="Direct link to 2. Routing your crawler" translate="no">​</a></h3>
<p>We will be making use of two handlers for your application:</p>
<ul>
<li class=""><strong>Default handler</strong></li>
</ul>
<p>The <code>default_handler</code> handles the start URL</p>
<ul>
<li class=""><strong>Job listing</strong></li>
</ul>
<p>The <code>job_listing</code> handler extracts the individual job details.</p>
<p>Playwright crawler is going to crawl through the job posting page and extract the links to all job postings on the page.</p>
<p><img decoding="async" loading="lazy" alt="Identifying elements" src="https://crawlee.dev/assets/images/elements-a634b50a7ad31ae15db61e1a06f5125e.webp" width="1435" height="727" class="img_ev3q"></p>
<p>When you examine the job postings, you will discover that the job posting links are inside an ordered list with a class named <code>jobs-search__results-list</code>. We will then extract the links using the Playwright locator object and add them to the <code>job_listing</code> route for processing.</p>
<div class="language-py codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-py codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Default request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">#select all the links for the job posting on the page</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    hrefs </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'ul.jobs-search__results-list a'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">evaluate_all</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"links =&gt; links.map(link =&gt; link.href)"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">#add all the links to the job listing route</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">rec</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'job_listing'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> rec </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> hrefs</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Now that we have the job listings, the next step is to scrape their details.</p>
<p>We'll extract each job’s title, company's name, time of posting, and the link to the job post. Open your dev tools to extract each element using its CSS selector.</p>
<p><img decoding="async" loading="lazy" alt="Inspecting elements" src="https://crawlee.dev/assets/images/inspect-90f77b162804bd1163b16bb23b315ed8.webp" width="1440" height="733" class="img_ev3q"></p>
<p>After scraping each of the listings, we'll remove special characters from the text to make it clean and push the data to local storage using the <code>context.push_data</code> function.</p>
<div class="language-py codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-py codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'job_listing'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">listing_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handler for job listings."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for_load_state</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'load'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    job_title </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'div.top-card-layout__entity-info h1.top-card-layout__title'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    company_name  </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'span.topcard__flavor a'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    time_of_posting</span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">locator</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'div.topcard__flavor-row span.posted-time-ago__text'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># we are making use of regex to remove special characters for the extracted texts</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> re</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sub</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">r'[\s\n]+'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> job_title</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Company name'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> re</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sub</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">r'[\s\n]+'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> company_name</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'Time of posting'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> re</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sub</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">r'[\s\n]+'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> time_of_posting</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loaded_url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-creating-your-application">3. Creating your application<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#3-creating-your-application" class="hash-link" aria-label="Direct link to 3. Creating your application" title="Direct link to 3. Creating your application" translate="no">​</a></h2>
<p>For this project, we will be using Streamlit for the web application. Before we proceed, we are going to create a new file named <code>app.py</code> in your project directory. In addition, ensure you have  <a href="https://docs.streamlit.io/get-started/installation" target="_blank" rel="noopener noreferrer">Streamlit</a>  installed in your global Python environment before proceeding with this section.</p>
<div class="language-py codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-py codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> streamlit </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> st</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> subprocess</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Streamlit form for inputs</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">title</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"LinkedIn Job Scraper"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">form</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"scraper_form"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    title </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_input</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Job Title"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> value</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"backend developer"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    location </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_input</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Job Location"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> value</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"newyork"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data_name </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_input</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Output File Name"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> value</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"backend_jobs"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    submit_button </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">form_submit_button</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Run Scraper"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> submit_button</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the scraping script with the form inputs</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    command </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"""poetry run python -m linkedin-scraper --title "</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">title</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"  --location "</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">location</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">" --data_name "</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data_name</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">" """</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">spinner</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Crawling in progress..."</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">         </span><span class="token comment" style="color:#999988;font-style:italic"># Execute the command and display the results</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        result </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> subprocess</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">command</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> shell</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> capture_output</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> text</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"Script Output:"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">result</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">stdout</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> result</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">returncode </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">success</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Data successfully saved in </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">data_name</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">.csv"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            st</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">error</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Error: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">result</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">stderr</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The Streamlit web application takes in the user's input and uses the Python Subprocess package to run the Crawlee scraping script.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-testing-your-app">4. Testing your app<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#4-testing-your-app" class="hash-link" aria-label="Direct link to 4. Testing your app" title="Direct link to 4. Testing your app" translate="no">​</a></h2>
<p>Before we test the application, we need to make a little modification to the <code>__main__</code> file in order for it to accommodate the command line arguments.</p>
<div class="language-py codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-py codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> argparse</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">main </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> main</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">get_args</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># ArgumentParser object to capture command-line arguments</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    parser </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> argparse</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">ArgumentParser</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">description</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"Crawl LinkedIn job listings"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Define the arguments</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    parser</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_argument</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"--title"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">type</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> required</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">help</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"Job title"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    parser</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_argument</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"--location"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">type</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> required</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">help</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"Job location"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    parser</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_argument</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"--data_name"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">type</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> required</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">help</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"Name for the output CSV file"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Parse the arguments</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> parser</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">parse_args</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    args </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> get_args</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the main function with the parsed command-line arguments</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">args</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">title</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> args</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">location</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> args</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">data_name</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>We will start the Streamlit application by running this code in the terminal:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">streamlit run app.py</span><br></div></code></pre></div></div>
<p>This is what your application what the application should look like on the browser:</p>
<p><img decoding="async" loading="lazy" alt="Running scraper" src="https://crawlee.dev/assets/images/running-555ab15f009be751f516aabd99e6c574.webp" width="1607" height="730" class="img_ev3q"></p>
<p>You will get this interface showing you that the scraping has been completed:</p>
<p><img decoding="async" loading="lazy" alt="Filling input form" src="https://crawlee.dev/assets/images/form-774ee8d03c87acfc38d3012d38a9c4ce.webp" width="1602" height="725" class="img_ev3q"></p>
<p>To access the scraped data, go over to your project directory and open the CSV file.</p>
<p><img decoding="async" loading="lazy" alt="CSV file with all scraped LinkedIn jobs" src="https://crawlee.dev/assets/images/excel-23850449d4d74099a1264cd93ca8565b.webp" width="1482" height="725" class="img_ev3q"></p>
<p>You should have something like this as the output of your CSV file.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/linkedin-job-scraper-python#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>In this tutorial, we have learned how to build an application that can scrape job posting data from LinkedIn using Crawlee. Have fun building great scraping applications with Crawlee.</p>
<p>You can find the complete working Crawler code here on the <a href="https://github.com/Arindam200/LinkedIn_Scraping" target="_blank" rel="noopener noreferrer">GitHub repository.</a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Optimizing web scraping: Scraping auth data using JSDOM]]></title>
            <link>https://crawlee.dev/blog/scrape-using-jsdom</link>
            <guid>https://crawlee.dev/blog/scrape-using-jsdom</guid>
            <pubDate>Mon, 30 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape using JSDOM, alternative to Cheerio and browser based scraping.]]></description>
            <content:encoded><![CDATA[<p>As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio.</p>
<p>This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM.</p>
<p><img decoding="async" loading="lazy" alt="JSDOM based approach from scraping" src="data:image/webp;base64,UklGRoYlAABXRUJQVlA4IHolAAAwdwGdASqABIgCPpFIoEwlpCalIROpKNASCWlu8hrke3XPLgHb9D1Wds7uP6XZF5P/t89k/4y9YOjyKelv5OR1wufOzmP5V+Z/rP7afDhc37F+Dvl1Ym/Ef2383dR/zBv0Z/33WF8wH7U+r35xPqAf2vqT/QA8un2R/3M9EP//6076W/3vrK81nzvcRey/PmfHth4ATxe0OwP8JP5HWyzxX/B5Uf23fH/vGHpLm3ht0692Id2Me7EO7qpM1Mx7EXjVX3Z+P0sNbmhaq+7Px+lhrc0LVSiIMc5QmllZTyQ6/UOl+HeivFrbDtSRugEP6BUI/Io7XtSExQKhH5FHa9qQmKBT9v3rKxJiZwlqpq2aTj/S243PsvniQmt2Me7EO7GPdiHdjHuxDuyVHuVr//+RRytVe5Xp42cEzTTpdQSRoMelzbw26de7EO7GPdiHdjHwUEH5RB5AgN7KtoB81ZAWCumRCeI3xh7nviHdjHuxDuxj3Yh3Yx7sQ79q6AzCmFssXeWHQihJQMLSDfUzVv8EH1Ps7sQ7sY92Id2Me7EO7GPdkIZfaen6Z1VcL5TTSj0cqLr1u+4OtYk6ViED5t4bdOvdiHdjHuxDvm1QJX5PR8lvcQmLDQlSVaSOyXUOS/TMD5t4bdOvdiHdjHuxDuxldSJ2EKAemmKo1uaCyZt46Gb5NF6sqSMPXuxDuxj3Yh3Yx7sQ7sY9/J2udY4xJydhkoMkml2RqBya2890xTSMqaTuxDuxj3Yh3Yx7sQ7sZVJ3XYB/hj4x16Cm99+WZUQdKpzXUP2cG6x2GPjHYc9lFTmMAcXVRZfYQeZXe1sTviHc30Wxu5gV6mbU609u7GV4boVQJeArWbeG3Ttfr1TZeVXjT4U9hRTGGyBJIoxrwpJKpweRHnFxFeQTf8w8he8rkWTNH5sca9IyHPYvJF0152hdyDaZvwU043TndjHofS90evVmMIWlSxyvnQ3HQWYTB3HQWf8oPF/K2CDePonCcndiH0L+/RR2NymLtMHOM5iJSX+iSgVYOHOzVVuGDXqwnw5Kex0udSaJPKhnatlyRm+zKn2d2G7FzF6m2YX9R7gmf5n8dePVSRoF92txlovLwtoxJ77P2Vtw6Bnr792IfRGc24UVNve6j3cXAxcJSdl6VK/+uvkGZD4O6SMPXuw4XOEJoc19BltrDUhMUCoQrEn8cT/Okrh+SYbdP8JsZ42ar/8kKP/tHILN/HizKI0b3Nz3xDue/jGqxXt1/kN/29L18Tc2JBXfOt1P4g9nUSgY5ht0/yjoDtBWxvM2sbdRoOElJylR03Av7+lUbO87rP92IdX36n00Hqj/TrpJ3Be5xiKftRbGHW97wIGOYbdP9YxipMOa3+D7sCoaigpQRxBxBNBCt7zEpICIQPlfJEgIjLd3UxVAmhlyqWYHGaSSZQxn5DDjJe7TBigXjnr/Pv0rEINL3QvSfoRmlBJ6uD/9150ca7asaQOFeBJIGkz6M4d+J7Ew26cm8oYI8NNLq6oIy8PmiT5pPg2NNaKBKC9Em+ffpWIQapo9KHsuGt/lRrxFyQCL9+zkp1MsiIw/WSSWcsRzmJPMHLDJvlH2nrb4pmwsSoXtD7TimKNyssrKaQf5vFMgaRK5gPmw0hDFUpb5BBUiaLyf7yYEk569D2mT+QHiHdrat3zD6LEuihoQVvkC++dkNjpZ8xmVFIC4dSjR5FnALGaFnjLnh48CBmJ+0p+yvSZ8aE0q4rIX7Kyt8j4xCQL9bjhgLVVIX+2ha9CFYx342uQKS93RSV+3Tr3YlFVabmIJRU9643pKXWswn9vuaznftXMRbdS2P0rA1Fmag1SgoQUsGizXtsvHqz3E2w6BYZfJwUaQhiqSQpjRhUUpJk6KIapxeeEHn4Nq+Bj0RFBEPte7opLQUgQPm+HmBPyguBszs23OXDabCM6AR9iqpK7fTKb1fq05zou1NLEeW9BvMshQFgg26HeopgJ1pkhkaAIBddmwa2IQL9ekhttDlwnUn6lR3VQIBJ3RMUj0XuzBufWvwzrKmobJaNW5dIbcNRWlemrWNjFZCCAFwZmi2SSwbcZ4nDSZsH2cJjMjJ70m+KJYuzjyD7YMarAJrG2Zl9FgfEwelUcJEULEukZz4WjqATFJ6UkCWIghjHVUOo4AGZc5DzQIJkY0s5BMyfnp/ZTP/DhDNww/ar7oCnfwLnouFAuJ9qpfESiusYDedgzq5A1Oj4R/+vbFxvgkD5uf2jjmBt3Wf7ryeaj7QqOnKykB0o/mBKz+6cGgOF1nDBOCsW5sxBcmD9hvwV6S0yBA72Cu8i8csiV0e2a5aq7+K7EO7IAjwCT/z2gW354UuFWvbCrky5qSFV92fz7bksyxXIt6Y8fe00B+yR8nQnTI5LioG+hUxhF6iz4g7qiXOWZBCnKu0Nd0x7tdRJzUkvS7wQesdJtrcrKNy3OwPpJP9Q/XmxK48psHdgf9G9UoY1bAQbNXOnBbeftaiWo94GPRxZfiobF+VRa9UWZTAcW95Qu3fJIbYXKsFStMOOUEqS7F9SXMxnsOQLM+f6ifNQb01gjMqy/V6o7Z1HPKF3Y9hYkYbhs/jhXNwToPGIryKeaku/9LlPUD/nWgvhsNRzhA+bDPQSPvNxhKWiui6oBdGffmgl+ejplEWb9ogEw0edwPBhp+3214zZM1iWo/IY3WnYZ4bi2/Nt3BgpSYFo9BSKBnZ3xzXp1YZ7iI7Lrp4BU0v0HEDVv8j0fn/n5IWquhoKnZ/QY1Q0IEuWP4FzcxzYIOYrEiAblenlIt0pchSkomdSaJP92txlqk1XLCF/iSocc7oEm9LcAY9UZsepi9s+qymtS3eRN1QNKChNZ/vhTrU5t3XpRDCXnAVQklQsDYTd2OBcbckhm7+dlQxRWc8gyNq8J7Mp5DVcVrNt50MxZNB/QDMzZPEPHOATnAOE5iphg4c7SW6XK7yljOJN2QonixK09Q/m1gj1V8WITGLPqq7lg7ROYa1yNLs7I3swpdpa7QZRpCoUbKN/0YCfikbvfnCpVOKItMzKHU/d7OlfzpSULSdZIVMTYT8cAvN4FcKGQCEelgbZxvC6PTMSyHX3Lf+hHXwzH5xki2XsbJ2fB54SanofQ9+Vx4Xm8n3Dqivym9ewr9wceYzOQselcvpZk5P2xp3z0WrhUQrHlJBnRhjf4v+dDviKUT0G2fKWd+IxH001O3N38cpowikOEWxqHAPVdnmLZa/3Oxm7HFTT+Yl+SyYWuy12jhwzAQK3NC1WGvEa5ZL7u5eOjinRkbIYzP1MoCPWp+ifZdyc+dmDm1F9DqI1S955dgBb08sNxpVm2Kuu7l2YzNp8573sRlV27fi3PGKeq0B4lYWd12IMmrQ4LceRgdLLvG9qr1kGX0sNQt3lflHrarYhx2MwqQ5nmrc+nhM/PK7vnBnkZdVmGllOMzQhEp3AmUDMOngPIYP7Upmkc8ky8yBPi+i21mZ/A4Euos0XHnU44+7PzWUyGVPEjK2zaOJa9mA6NhIgeKkI8c6xfuV0RCh00UUseY8C65NXkP7QzH6M7ta9OTqtv7jwrG2Y0C+GHG3F0wX5N+slhBAO92Id2vole9+wzoxP7GN7uWG/xQK9+FDEa7l9J44r0WcMcHHpUrOM3di1z6IF0dKsKSDsO5PKmzcrIwBoi9xNPg8poRRUbY13nZ0rW7pV48xxai0JnqHNNSPWodFDEoZaaFA0bfvhjpj0hH+q0gQjJzvjMoBdIEDUWp+Xae1EYROK+CcMxhbpmSJ9787oR7UsW8z1VZvrt4dAaVkMb30qVHuSW4hCA33LfVQscWL3wZeLSJKoRND0b2qStKuWa/fiY/SpP8B9osXO+tmgJ6AEqVCG4shzfJdos392fb13MPz4DlmojevdiHdq3JKMVshhVEHLIsALp4D50t/LEN+7sc3WFn/JpEkndiH0yGikF35Soeoj2pfhZpc6kjQLn6zivOyomdqXj1UkaBc/c1XPXqU9iBc6k0Sf7IAP70qt+Fw3yknAYA7LVKgOvTmhueEKfCy4Bt60SjxBgq/nyMoMIFeUqAIbvDD5idTJMUs6k2/U4apNwXa1R0vZcc/hbckkBk4CE5CEmNQTTn71Tx5eA1SqJvM2jQs8Th+R4xZ4nD8jxizxOH5HjFnicPyPGLPE4fkeMWeJw/I8Ys8Th+R4xZ4nD8jxizxOH5HjFnicPyPGLPE4fkeMWfk1JaYSL2sBIPU4lijrV83LuXubZbWNMbtYlJLyunBqwxhiXPhSHECu2B7Whe3OaN1B+AAAACgZSXguc4Z7wrczIZWB1dN4az8IAMYBTFGnsQFxatGRm3E+X5qKKmdtvbSFpnvAAAAAFAv7WwYxOZjj4JZ2xNprcvMngNbZC0yXcZbCWf+Ub36jieerr1o6RKjq4H/c4t9JUCBFW7PT4AAAABGc+sdqd/R2v/yRSMINAdmonAerozQP2B2tKdTLGRb6jWgUNFh7cFEjAFGqI6AKVcBAAAABDwZCJaZuPJuE+kF8Z8OA8b2KoIQmNZBKB21js8MxLaTXHWdQ4rASXyKrW3OBHWwdIAAAAFAUIX4IEmthvVUdZ19K2AbHS/1B0ThS65THEhBrfG05eiKuw9kj2PwpVeNkzgzqi3xSNiAAAACezDFfm01s9DrZ4vobHKFkWGP2DnB0SHl9B/C5qd8iMxRRJldNJw+IV7ifAPJmrY0oTKkUW3YROS0wAAAACe+wsV7E06xgCYcnN+dtfIsjMF+a5bBP4ANJshTFG4bJTYDIkv4M5UkIJgzEBXcHaIicsqAAAAAE9l9jzY8o5MT7/nINttrv5BV5BxXCYEJ1muPR8Eln+KvCWKefQ/L06OYOAAAAAise67IcdeSlQkU4pf8mnbi7Fn9q4zMy2sB005Zx71UVo2px4Fx3U6VUjFskwPToe2v66Bysmma/t4ndDrQWyZIJ7HTqsmcpmcpFkGoFig6aHacJg83flHnbbABkn3UAmueMH9Y00cW9lTLUfOY/YJsEIvnVcAdYJ8HlEIGfKE7GmtcRytQwd6SBo0/wR9zh0zeBvn8/A0K1nxq9njkTczwZ5Gsgpt379mp6QYzX+L/ZXyAoqJPP9DItNcsUTx1Q8cvUbzemTP0nkugv1WqyvTSWhlBWsdb2RIegwnl1iEz2i+ReqjjFop761Vrh3s0ixRuOrVjUAByHHe6lkvzRC3Z7sfxIdCNTw85wGXsV8LBvct4zP1ANxA36kCJtxd2oNdxoIs9+GSV0L/hU07CPm1eg1T/gWi5aPuz5QZjAnJjSZrPQuI2+14FE3EiVVK8o738CeuVsWYZV0G5uBAdKxks0CBjnTyTZfxbvjfl0ZRw489XB6RLPlmMA+eLREaM/jfbiWNSLS1QpqaFguiNcqLt+0mlwpY43mE4TMoB91tcSc0JynT1Fr/66aayPSu7MKlR6Sb+/f2Mtc+/enAcrnEqmtOt8xpPWcR2cwp7ItGs61dxpHie/OxEGnj+M/79bbj/pUEVZlSpqJhOKFxi3ideQ0goKlPy/DWQUTXzHnMEu7DxrobLcthlbblJm64fOshLTVyXrnJOWmdSG5fWItHNyAx6BjYReIiZJhcZviw/vkPNOVYMRjdUHFuPRByQpbGsfSf6ncUJi7oZKSMTU4lWKl4EXTWoaHAG7LmPb9V9d8B066Ap+OO4jpuwLBxajh+ngn3Trjm/pOy8fJchJT/JbcclJ2L7TeLeSbSxw8XG2IYnNu/gWtXA6fFtKMgMSvAStWv+0NO8l2/Oi7pEjTmPJBOPLUPuQKF/jDMn+iM/W09qEBe9Sxw/2ZcSi+FyC8fDw9gnkn0CRqwKcarOPf2EtfMpzC1SC97iVj+FquFqdQgacW51kXs+L+KvcxdHdiZ0oklhBtItEYFhMWtJ1ZSMbneoape4iSBh8Nl8n0XAwIzWE8bG9wY2yt6tCLFBbfoQE38A9KukJ0ueJpMRYmCcSlUXhAdX59oGlkxWRHZ72ilX6zOPE6JA/JshqGTqIxO/BDFckaFSQlCYX+ey0+u2tAIXc/IgbfF77X2256/s5xCJyylXXwDZMHbCW8fp8Lj/ZHyOXKka8GbjnpHkjFV3esGUsN3mDt5bOmc36OHS3dtzo9787ECrKVLjj2Xj/7/oFJSUNEi2S9vmZmJRT5D+dGI85GGVuQvrEbfaaCZGDUMD7ZAv3MIaxYxnYBn0ER+pLRYTFyLuSIY5fSWg4kCcnuupFX/i9MAEj93r1TrUQP3VCfK4EN3FpC3XNySvO+hxo1ehcSAFZrQGaH5MWQDKwyGtv+ljyhObmHMBcTj+LxW6iiQBxIHBVe7/fXCnLkngDsNvMMv7Ve7z0Ek6ZBBX7pNdXDOpByGkutTe9Sra/40XxiCIk+nPCcJPWbBTjYpr9ruFb5o1SuU0Qv5n8HYKv/nRd8MOrHI3/8qviuR73GgDBKD46Wq0dKGEhPe5cg2Cl+M17WRDitU7eI5nWPRGX2QZgkypLG4NGYHmxDCubFyhmABvF4rb8NdJ0fD6azOW/PtZ3tSkhrolI+gGXKCDUuZKwpNmZeLNR64tKsINJv5iYYp8XqeAwZQymTT9TcyquM+57bA8QOaZa8zvx3pI0VHF+LNrXex8R7yf2xGYEJlkEdY9aXGW2BC0qE/K/UrAoa0xmvGHDq5op/syk36LHKP2ybAjyvULOvYfA+4MkgsB+w4W4eVk+ZdY9TdBx8YRcDwoqCP3ngrBipb4ZbAl2HLZtKhXl2U3FABEiUcyUtvps9GuykLj1pAn45sZsymBkQ6Dzj1fAH8VSj+b0T7bCZpfKE3Tj7myMW5J9DuGdZ+QBR+UA01hqVrroOOgSRnEDyc9AhoqdbAVfv0A6b6ho2JPs4Hp3QjJvmaDH8XR1aibcTFe32+5LOpAi75dBPs5ljubeb3//TWtSLiPmlE6VBBGamhh41AZTEsNrYZWeHGDSGbw//LkD6lkp6QaK/4yG8Fl4voW+snkwonBoAIDKhjudqXFPLtBt0T+4lP5T+wZ/9mVeCq7CdS3syGmi383vVS3ast0nnpmp2OUYConOWp4a6TXtQEJQPk06Z1xPsvAMjNslqCn+y8numhXEpwL1z4fSAz1IcZ14C7GHnp82Rci4A0SNhqOXoH3DPRQZTOBxEUlb7I8XqAbz3bktt1BT9dkNcQ9P8oD3iqzXS4tyJo2Pj3WQ1RSl0f1XcsPNv/QFFw1oi7DRTzpqb/90wp5anJWIt5ZocLda6GN3d39ujh2X6/LY8fj5+q3sX4FYvW+oEcZ+Uv9Jg96dtbdJdqP582wDp869YIEgT+CxNCxNCxNC08a5zqvqU1vpVcY8JpdANT2/dDnLZuVaI31h26KszeFcmwJQYiqvnjuQdbLB6X8KnY9JsVYEikVA3MDZETgKGB2lE86NSdVwHokfN2p+RDrChCCj8v4zbu4ouQ+1NJ0Z2zIXc3ASoZ4SQz6sKAl0VgyPQMMXx2hpD+0b+R+KLLke1s8owlN4VYRE5iqCXx1PVBIv10eSFeY10Dc27kvNBHHPWdYSR8eZ5CNtjvh9/MNJ6u2YToBN/T1dOnX0VgPsO4yKHbaVSTWfJfU/JALE3dTnrNWLnWQz2WbMG4Wdt8cyO3m34Ah4nZBz19j/X5s36I3FWnhKQlBLE7RwcTFzi+huZoVvLELYjO8Z5lFp02279SCW9jD10ZXH0agtoxawa8T3Xsg/o5CHafdQv1FnEDqMwS/GaiADDipfU1wagstKLbhCeZrga8jVuZ0ZrVxY0WIb/RSoM14pmW8E6vZahRGg7oemlhGSFdC9HGuOljaYoaVXaiWKKzeGGeDE2dwmYZGDwG5OLB+YgWo9MiKomEqcmq+ykuvtQziB+JVAtgpc7czSq7O8L73B2Q9Fre8bC1sqsW5qzfHU2/10dJhcZnXM812+4Q4WAyvFPiPjFRs4MPkyGGwQ1bn08XmxEB6Cc1Yut4noTDL4EoCjYaNgWOzkF8gN8yEKiPNPLPls1j69bSTI61YlBrart2YHcKRqAN3loZOjzTJXY60+yrdO4oPMqNQKEFLY/3d+Udftrj8jV32V1PzNviY8IKUbCxgF/30KqHefjD/OhQ4Y3qeaVN8U3V9ced23AyPldtjn6rwt9+F/X/JvVj8z1EwNM/t7JVR91nSlejns609ZOBsilQGCDWbipid8RVKxgb0wADr1rcnWTKVcr0rObCAlUJQko7FugBsncvg7g1hYGyBPsQ7cCCOCWgtMOguCeU9rk2M2H+xnqW3cChfe9jFImpHcrREOmwf7Rl/hPldHY/1J8ol9y1UhCtNm1RYnFvC1YxQGk/tFxyGBrS86qwpcrKzYYYlt087ttysxy2QSbNUBESAEk+YNAvfDD2FddBUC1xFnARYI92nHS+lcyexLkTIwhYR/MPf4EHbI472EluNATMZNbxaszNlchIW+TQu7BlD3SLvkIX0UVRAUCp0lbt33zzNgUD2vXSTzejc3GFtdwdtpyXs3fPZvj4CSDwA30LTMRZH1HKQoMDKTHmn1s8TRhe+3bYTvFwl+rpbHDr42GhjZWGhkXsw6+XAkOuHt+863o+nMY7yEI6cMVLPJ8xd/pg/er8aAgC24pfTwHBFmOQbMw+gTeeU7kXx4dWEBS5vcBHr4nUd1gAg1BwAQf3skpo9EKRUjakytqwPWeNN8fl4gS1nAP6UnxKN+t6zOYaJkf4Ins7OcAJljs6VU0/UlXXBK5OS7a9ept6JYorNtheoez6IjQ49NSlRET9POE59rEcvbjlkWeAwuaQ0gYQuCBq0LsFspq7XKPhLnbmE1Ie0qzc4m1TTOX6QJ8Et9HGXBujGoN5DeZAQ+AfT8CJWYtGReUGf6wAPUb4N0negvGJoHFfEpeTNWCeMbAhTvMqcbkPzEAUEpF9AkwsR9BFOnzoZD8QfKG0BvHHGwc93fGLCIz0M+hzqC4D+AeDbq0Knzb771+FLhOgRT38DVYbznTIGLRbSSlEcLfvimOqJIpRCLwcUi4Adh9jyVnPUO9p/ZZphvtZ0wqPN8OGXqPohtMn4v6WawiTy5LW9TLoxKmbK7RTcP/e5xPfaoAwOuEgJMWNFgEJx/emx6kT/8sXxfBk+rqoMonvmynxkXvbq5Jighwe7STRAoqJagdBQoLQdyUjisWNHpzLUWKMUypATAgNvenjVdGP9XL4Ihui5IArtRT1L7Z/rkiW1Sqdz56+9v2ik7mtgRNeyyRUPa6VTcefdbxIi9BRsoHx+yRsfJp55pdjXJU4tfcAb8QvaT1ibKNupF10KlvbAteUVrO7todbx/qPp7Xdp2gCuu273OVENYbhdDC4jJBfe7kWVcWHn/sdN9QEu51NYYtUqLqs0FnRAsAMU/FgyYGr2I9ixne5VpWEsEzEwt9BX8uNmcKSGvL/iVKoc5/d+WQi6ToX2KM1eLoJG8z5hG8odHMAALXr3I5vwXviu+7ZeReJpb/BKiJeX0/Wozv5u3HyeGzSov3UxgsDJj8vqIpi8eSlllb8ObgXa3mNCX4y1GiZ5kL4ve8IN/4DgRQSis2ujmhHKEHeeL32jzopGde0AuYX/GeQtqLuvo9PORQm4q6JZt6cIoMC2yGx3WpkXR6OGybNeX8hxeUooojNzeTZfTDoNRLfitCMtukF4gsvMXTVcemhUkdddPDyBICua6Vdcrw7BIT5J1fSsD66eCDHnM5cQQoveFUNnOeiGVG4338+IX1Dmn05vw1Sa+O5fK2V7V4p3UDIlm0VdgeeaLB4dqBer6mCOXeHuz+uBQfeBcfLK4DWaZfhI+jfQ1ZSJ8Db4h5zu/j5j3NGpwTDAUeSeY/HY8L8vmyQsTqZrH2kS4dSQGzUltko4sKs0zeUTqA3ZjOVkoXrnasKNWxbm5gs1W/10tiX7s+dU0VQwSd1SBDDdhHH89VJFChvw2gj7iYF/fjFnCTdMNi962mdTe+5282iskn/SvDUPXfsqQbg8suRWdBUvK/WvTbI8qcLuPx/gUI/G1bUAPWeQp/4iGIpdikW9hJ5RxVhgUPdBcXz7Yi8mYl8Irc1T2OkJP0nQGI+moI0c9pWBoG3NhIEVVqV8Z8h6/wEM4Vxhq7AxpOm2URCLwVsFA7vBhXegJDXwx8AOAfloXAYRAUnd/94bY0U82hH5EwnQVvFtzlclIWT8YBieFHc+eV9C/a3xnjT4DCEuuHUR/BEjOVvT/2dznReyGOL11jFYIJzOwkRNoiUIF5UbnBe4258Vc93Cl/YWYa3pS1g0OHrobaOWESgvAQmCq3Ug+c6Zk8aNNN6wcsUP+nW8z4OZaYR3mrMcIXsIrKl4NPP+MuyqS5dxRsVJiV5JCkZgjeSxnGNBHSbr2Y8ExR3W1Yl5vAGY3i3O75KyZi5R8VUtNGjG86JlhxebOKsmHcjBRlwxwsY54s4eMLxn6EO4L7QuOsOL/HB9f6CJ3W/LjAX8pSMkOQ7BuKMy1eSRBfQsavOHMz0OAZbeSi1aLSFGo4g08tPL3uIRswxqVPI/p75nE/PGps/ItImobbbQbfdNYmp5CE6NaGbf0bGqzaJmzGZ+Ty9C1i9RTmlTHGtiypNeWbpQg8hQM/gu+xWu8+IzdXyqR+SrIN50xFANydpLWxL5Bp03Nh/NiZtIJ+ULz7q1BisqyQ1v0WG2ZWnn5zrmm7TWlKQVeBDoCjZjI0RATnBazmbQlCJEsiFBSktaPdnNlkFjwjagJdAEiulancMu67iosEnDprowzKoc8FZHARaqXdHgwzaAcVEyRgJD+o5Ncdi5UU+yKjhuhFjGwfAAzzmsGfvnOWGOUGdeF38RTDNpD+GYlyjXaPj3ZKHN9N2OSjDoMKm+DIm0R7ElCtHcPgTQhaXm5YuBwVfgY4J2NxpOy7IWrQzqED3zXMDhlso9OFUdLVLP2yii+tQ4IIxE1ddKFXpUealUbNTjMrm9kjeWRi1TX9E4mCrQfexWa9nqgdI98nHp6+9sx+ZEZDkLzzTFNkhUUOzSWlNGhc02H87G1ZvOE5WN/AwutrCAFmVM+917ex+ilY7sqFTK6Y8psgLiChJiqxh+TRi/XGDVdK1cT0Td5Aggu+wmMJPlIAxCfWhFjxJCgIz/v938tSqygvglsZ6maEeWYyGCg8OwqY459ATfDLA1vaUNg5aH8JUKo4w4+GvuEu9C1lJjrjQaW1g6Fa2U9jdtMTPyxNUxEunpCgk4KdA9teViVBhVJ1kMDP1tOMNDUHvVO6WlUlSw9O7ofsg+H8/wdFGVieKOjHkEpClyjlc8HxUre66KyrX1+NaoS9Eskg94XASK2+fT7wsQbC81P3P/zaSsTMM2dCzvRcRYuYHqakiHnZwnB1x70qEJaazTBGCbEFPJ/KGojkE1DlSI2xBkQF7COEMpBP26WnW/f63impb483ZLZSJGJ10FOQ5hFh7AXrDgrwUiEOiT5OoNyVRvkUKmR1cgOeQyFY9REftEZEYqtPJurl0LzekQV/k71/TUuPW7z7RMTAbZGqB93pru1wl+lqTrQmPeRxRxNcxsJQuXjjlqa8CEcxEWEhPHCWm2dJwlptmcskqythSYCYmx2ouPLXfy2Z2EVU0d6Lb3kMVl5AWce0eE3/ou4Ir7sW8okWwRKqBUbYF/AXrIqJzJT6s/7rBSFJb2o5RyGMxlihNHOc89REmYTdSs19CtC9PlEfXXPJBkLDo+NdaSMZf/v6UG4FqyRyKGtjEHtYA/6Md5r1bHys/+2V7INVo8UBqC6Ebnqpi6xCB9wXq6EGOhadcdOjISVdqzovg93PVPA3C/kjhVNcprG+qfVVMSsp5+F69HPsEABj0wBgZQ0kAFNWwmlB/oR5Kc4Yck2pjKYcPYRJiK9uYI/rsa8WeuUMsY8OowF5VdjuEYJ0peImt1VTBmbs1n3Z0gQfAAzgvA/qAAez+1j57asGP3wSXFokIsT6y6f/YOeaKlgPaFQPDx6Ptf4BVjH0mGfLqJN0UK2Gx3fyfyPlGYZBgUHTSdNUcgUY7Gr/K7NI4Xs5vMmG6cptvsQjWqpSqcHsAKDEAT9rMFlUu/N8xyyMgkHcADe7Pl18uuGi/7sbj1fJU+o+eAibl2YZkylocZCWrBQe/h10Nl2Kv2kYiRSeMK/vbp+hVtNuOy20D0yBR3LyIyIrya0JL2zo9cDhHPDBic92UfoytHDTFFAo7v6LOZgwHzmz8iAJOR9MG/qHjdFWRqqH9zBap7bZKfuRm3KpEOcIoLaS672lVzwEzMC75ABdEjtg2OjUQ4SYpcC+1azsFAeAJMK2Zi1T7iNXB1LgxcyblWAwiNL6gLouXLMxlj+1r2wr+MGDFGvWLwyJ+obzAYRBhO7rqrkt6+dcaHBpwdjvFvAYR6wg4Ciuj5+aGlOmfYCFweMWKRgMBhHrCLMnFLayVGqhQnhvAeHJ0fGS9FGhZEITOBWCfrGr7sDLhdpI8YBHFNd5IMwLyoZ7AgdEeGoGbvbQCvU7bzhHJ8joc7pfD/qcyIFoJ/4ADzmBXuPOXlt580YtHtyK864KXro/ogKyzC3GbNw9uKfCprIStLLKgVoxuTk6f2BQYPOToPpxSy0Z2g2qjA9fgtJZmUSQUDnpR9s1pNwq3To11J4W2gzzEWzVfAccvB0E77GGBAC/QAAAABYtF5UgQd+5iBPX/r7w7aRkWAzqRkHkiDj+cAAAAbEw3Sxc55BkJ/yQCl8XZiiu9tnkC6Qldg21zcIyev7dv8nwoPQMSjYChAA89sfRhAZD1PEJE/mFwQJQhYTCK6aBfHtU04PwEAAA=" width="1152" height="648" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="analyzing-the-website">Analyzing the website<a href="https://crawlee.dev/blog/scrape-using-jsdom#analyzing-the-website" class="hash-link" aria-label="Direct link to Analyzing the website" title="Direct link to Analyzing the website" translate="no">​</a></h2>
<p>When you visit this URL:</p>
<p><code>https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pc/en</code></p>
<p>You will see a list of hashtags with their live ranking, the number of posts they have, trend chart, creators, and analytics. You can also notice that we can filter the industry, set the time period, and use a check box to filter if the trend is new to the top 100 or not.</p>
<p><img decoding="async" loading="lazy" alt="tiktok-trends" src="https://crawlee.dev/assets/images/tiktok-trends-1b92bf04848ae6c440eb1e9fabb55a41.webp" width="2408" height="1314" class="img_ev3q"></p>
<p>Our goal here is to extract the top 100 hashtags from the list with the given filters.</p>
<p>The two possible approaches are to use <a href="https://crawlee.dev/js/docs/guides/cheerio-crawler-guide"><code>CheerioCrawler</code></a>, and the second one will be browser-based scraping. Cheerio gives results faster but does not work with JavaScript-rendered websites.</p>
<p>Cheerio is not the best option here as the <a href="https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en" target="_blank" rel="noopener noreferrer">Creative Center</a> is a web application, and the data source is API, so we can only get the hashtags initially present in the HTML structure but not each of the 100 as we require.</p>
<p>The second approach can be using libraries like Puppeteer, Playwright, etc, to do browser-based scraping and using automation to scrape all of the hashtags, but with previous experiences, it takes a lot of time for such a small task.</p>
<p>Now comes the new approach that we developed to make this process a lot better than browser based and very close to CheerioCrawler based crawling.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="jsdom-approach">JSDOM Approach<a href="https://crawlee.dev/blog/scrape-using-jsdom#jsdom-approach" class="hash-link" aria-label="Direct link to JSDOM Approach" title="Direct link to JSDOM Approach" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before diving deep into this approach, I would like to give credit to <a href="https://apify.com/alexey" target="_blank" rel="noopener noreferrer">Alexey Udovydchenko</a>, Web Automation Engineer at Apify, for developing this approach. Kudos to him!</p></div></div>
<p>In this approach, we are going to make API calls to <code>https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list</code> to get the required data.</p>
<p>Before making calls to this API, we will need few required headers (auth data), so we will first make the call to <code>https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en</code>.</p>
<p>We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data.</p>
<div class="language-js codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-js codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword module" style="color:#00009f">export</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token function-variable function" style="color:#d73a49">createStartUrls</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">input</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        days </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'7'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        country </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        resultsLimit </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        industry </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        isNewToTop100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> input</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> filterBy </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> isNewToTop100 </span><span class="token operator" style="color:#393A34">?</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'new_on_board'</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">''</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">return</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token literal-property property" style="color:#36acaa">url</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list?page=1&amp;limit=50&amp;period=</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">days</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c">&amp;country_code=</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">country</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c">&amp;filter_by=</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">filterBy</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c">&amp;sort_by=popular&amp;industry_id=</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">industry</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token literal-property property" style="color:#36acaa">headers</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token comment" style="color:#999988;font-style:italic">// required headers</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token literal-property property" style="color:#36acaa">userData</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> resultsLimit </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the <code>creative_radar_api</code> and fetch all the results.</p>
<p>But it won’t work until we get the headers. So, let’s create a function that will first create a session using <code>sessionPool</code> and <code>proxyConfiguration</code>.</p>
<div class="language-js codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-js codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword module" style="color:#00009f">export</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token function-variable function" style="color:#d73a49">createSessionFunction</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token parameter">sessionPool</span><span class="token parameter punctuation" style="color:#393A34">,</span><span class="token parameter"></span><br></div><div class="token-line" style="color:#393A34"><span class="token parameter">    proxyConfiguration</span><span class="token parameter punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> proxyUrl </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword control-flow" style="color:#00009f">await</span><span class="token plain"> proxyConfiguration</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">newUrl</span><span class="token punctuation" style="color:#393A34">(</span><span class="token known-class-name class-name">Math</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">random</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">toString</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en'</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">// need url with data to generate token</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword control-flow" style="color:#00009f">await</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">gotScraping</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> proxyUrl </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword control-flow" style="color:#00009f">await</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">getApiUrlWithVerificationToken</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">body</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">toString</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token operator" style="color:#393A34">!</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword control-flow" style="color:#00009f">throw</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">Error</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Token generation blocked</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Generated API verification headers</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token known-class-name class-name">Object</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">values</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">return</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">Session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token literal-property property" style="color:#36acaa">userData</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headers</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        sessionPool</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>In this function, the main goal is to call <code>https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en</code> and get headers in return. To get the headers we are using <code>getApiUrlWithVerificationToken</code> function.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before going ahead, I want to mention that Crawlee natively supports JSDOM using the <a href="https://crawlee.dev/js/api/jsdom-crawler">JSDOM Crawler</a>. It gives a framework for the parallel crawling of web pages using plain HTTP requests and jsdom DOM implementation. It uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth.</p></div></div>
<p>Let’s see how we are going to create the <code>getApiUrlWithVerificationToken</code> function:</p>
<div class="language-js codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-js codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token function-variable function" style="color:#d73a49">getApiUrlWithVerificationToken</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">body</span><span class="token parameter punctuation" style="color:#393A34">,</span><span class="token parameter"> url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Getting API session</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> virtualConsole </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">VirtualConsole</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> </span><span class="token dom variable" style="color:#36acaa">window</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">JSDOM</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">body</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token literal-property property" style="color:#36acaa">contentType</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'text/html'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token literal-property property" style="color:#36acaa">runScripts</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'dangerously'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token literal-property property" style="color:#36acaa">resources</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'usable'</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">||</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">CustomResourceLoader</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic">// ^ 'usable' faster than custom and works without canvas</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token literal-property property" style="color:#36acaa">pretendToBeVisual</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">false</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        virtualConsole</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    virtualConsole</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">on</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'error'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic">// ignore errors cause by fake XMLHttpRequest</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> apiHeaderKeys </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'anonymous-user-id'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'timestamp'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'user-sign'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> apiValues </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">let</span><span class="token plain"> retries </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">// api calls made outside of fetch, hack below is to get URL without actual call</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token dom variable" style="color:#36acaa">window</span><span class="token punctuation" style="color:#393A34">.</span><span class="token class-name">XMLHttpRequest</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">prototype</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method-variable function-variable method function property-access" style="color:#d73a49">setRequestHeader</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">name</span><span class="token parameter punctuation" style="color:#393A34">,</span><span class="token parameter"> value</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword control-flow" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">apiHeaderKeys</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">includes</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">name</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            apiValues</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">name</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> value</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword control-flow" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token known-class-name class-name">Object</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">values</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">apiValues</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">length</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">===</span><span class="token plain"> apiHeaderKeys</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">length</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            retries </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token dom variable" style="color:#36acaa">window</span><span class="token punctuation" style="color:#393A34">.</span><span class="token class-name">XMLHttpRequest</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">prototype</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method-variable function-variable method function property-access" style="color:#d73a49">open</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">method</span><span class="token parameter punctuation" style="color:#393A34">,</span><span class="token parameter"> urlToOpen</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword control-flow" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'static'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'scontent'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">find</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">x</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                urlToOpen</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">startsWith</span><span class="token punctuation" style="color:#393A34">(</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">https://</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">x</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">debug</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'urlToOpen'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> urlToOpen</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">do</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword control-flow" style="color:#00009f">await</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">sleep</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">4000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        retries</span><span class="token operator" style="color:#393A34">--</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token keyword control-flow" style="color:#00009f">while</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">retries </span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">await</span><span class="token plain"> </span><span class="token dom variable" style="color:#36acaa">window</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">close</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">return</span><span class="token plain"> apiValues</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>In this function, we are creating a virtual console that uses <code>CustomResourceLoader</code> to run the background process and replace the browser with JSDOM.</p>
<p>For this particular example, we need three mandatory headers to make the API call, and those are <code>anonymous-user-id,</code> <code>timestamp,</code> and <code>user-sign.</code></p>
<p>Using <code>XMLHttpRequest.prototype.setRequestHeader</code>, we are checking if the mentioned headers are in the response or not, if yeas, we take the value of those headers, and repeat the retries until we get all the headers.</p>
<p>Then, the most important part is that we use <code>XMLHttpRequest.prototype.open</code> to extract the auth data and make calls without actually using browsers or exposing the bot activity.</p>
<p>At the end of <code>createSessionFunction</code>, it returns a session with the required headers.</p>
<p>Now coming to our main code, we will use CheerioCrawler and will use <code>prenavigationHooks</code> to inject the headers that we got from the earlier function into the <code>requestHandler</code>.</p>
<div class="language-js codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-js codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">CheerioCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token literal-property property" style="color:#36acaa">sessionPoolOptions</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token literal-property property" style="color:#36acaa">maxPoolSize</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token function-variable function" style="color:#d73a49">createSessionFunction</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">sessionPool</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token function" style="color:#d73a49">createSessionFunction</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">sessionPool</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> proxyConfiguration</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token literal-property property" style="color:#36acaa">preNavigationHooks</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">crawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> session </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> crawlingContext</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">headers</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token spread operator" style="color:#393A34">...</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">headers</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token spread operator" style="color:#393A34">...</span><span class="token plain">session</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">userData</span><span class="token operator" style="color:#393A34">?.</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    proxyConfiguration</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>Finally in the request handler we make the call using the headers and make sure how many calls are needed to fetch all the data handling pagination.</p>
<div class="language-js codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-js codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">requestHandler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">context</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> log</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> json </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> userData </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> itemsCounter </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> resultsLimit </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> userData</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token operator" style="color:#393A34">!</span><span class="token plain">json</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword control-flow" style="color:#00009f">throw</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">Error</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'BLOCKED'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> data </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> items </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">list</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> counter </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> itemsCounter </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> items</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">length</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> dataItems </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> items</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">slice</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        resultsLimit </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token plain"> counter </span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> resultsLimit</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token operator" style="color:#393A34">?</span><span class="token plain"> resultsLimit </span><span class="token operator" style="color:#393A34">-</span><span class="token plain"> itemsCounter</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword nil" style="color:#00009f">undefined</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">pushData</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">dataItems</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token literal-property property" style="color:#36acaa">pagination</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> page</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> total </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token template-string string" style="color:#e3116c">Scraped </span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">dataItems</span><span class="token template-string interpolation punctuation" style="color:#393A34">.</span><span class="token template-string interpolation property-access">length</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c"> results out of </span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">total</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string string" style="color:#e3116c"> from search page </span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">${</span><span class="token template-string interpolation">page</span><span class="token template-string interpolation interpolation-punctuation punctuation" style="color:#393A34">}</span><span class="token template-string template-punctuation string" style="color:#e3116c">`</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> isResultsLimitNotReached </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        counter </span><span class="token operator" style="color:#393A34">&lt;</span><span class="token plain"> </span><span class="token known-class-name class-name">Math</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">min</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">total</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> resultsLimit</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword control-flow" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">isResultsLimitNotReached </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token plain"> data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">pagination</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">has_more</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> nextUrl </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">new</span><span class="token plain"> </span><span class="token class-name">URL</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">url</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        nextUrl</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">searchParams</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'page'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> page </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword control-flow" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">addRequests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token literal-property property" style="color:#36acaa">url</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> nextUrl</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">toString</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token literal-property property" style="color:#36acaa">headers</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">headers</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token literal-property property" style="color:#36acaa">userData</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token spread operator" style="color:#393A34">...</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">userData</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token literal-property property" style="color:#36acaa">itemsCounter</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> itemsCounter </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> dataItems</span><span class="token punctuation" style="color:#393A34">.</span><span class="token property-access">length</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>One important thing to note here is that we are making this code in a way that we can make any numbers of API calls.</p>
<p>In this particular example we just made one request and a single session, but you can make more if you need. When the first API call will be completed, it will create the second API call. Again, you can make more calls if needed, but we stopped at two.</p>
<p>To make things more clear, here is how code flow looks:</p>
<p><img decoding="async" loading="lazy" alt="code flow" src="https://crawlee.dev/assets/images/code-flow-9b59d77892326bdf8ae27f1e99489c9e.webp" width="1536" height="884" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/scrape-using-jsdom#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>This approach helps us to get a third way to extract the authentication data without actually using a browser and pass the data to CheerioCrawler. This significantly improves the performance and reduces the RAM requirement by 50%, and while browser-based scraping performance is ten times slower than pure Cheerio, JSDOM does it just 3-4 times slower, which makes it 2-3 times faster than browser-based scraping.</p>
<p>The project's codebase is already <a href="https://github.com/souravjain540/tiktok-trends" target="_blank" rel="noopener noreferrer">uploaded here</a>. The code is written as an Apify Actor; you can find more about it <a href="https://docs.apify.com/academy/getting-started/creating-actors" target="_blank" rel="noopener noreferrer">here</a>, but you can also run it without using Apify SDK.</p>
<p>If you have any doubts or questions about this approach, reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">Discord server</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Web scraping of a dynamic website using Python with HTTP Client]]></title>
            <link>https://crawlee.dev/blog/scraping-dynamic-websites-using-python</link>
            <guid>https://crawlee.dev/blog/scraping-dynamic-websites-using-python</guid>
            <pubDate>Thu, 12 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape dynamic websites using Crawlee for Python with HTTP client.]]></description>
            <content:encoded><![CDATA[<p>Dynamic websites that use JavaScript for content rendering and backend interaction often create challenges for web scraping. The traditional approach to solving this problem is browser emulation, but it's not very efficient in terms of resource consumption.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">discord channel</a>.</p></div></div>
<p>In this article, we'll explore an alternative method based on in-depth site analysis and the use of an HTTP client. We'll go through the entire process from analyzing a dynamic website to implementing an efficient web crawler using the <a href="https://www.crawlee.dev/python/" target="_blank" rel="noopener noreferrer"><code>Crawlee for Python</code></a> framework.</p>
<p><img decoding="async" loading="lazy" alt="How to scrape dynamic websites in Python" src="https://crawlee.dev/assets/images/dynamic-websites-d9a83deff0729330b2d3de2d1481cd6a.webp" width="1152" height="649" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-youll-learn-in-this-tutorial">What you'll learn in this tutorial<a href="https://crawlee.dev/blog/scraping-dynamic-websites-using-python#what-youll-learn-in-this-tutorial" class="hash-link" aria-label="Direct link to What you'll learn in this tutorial" title="Direct link to What you'll learn in this tutorial" translate="no">​</a></h2>
<p>Our subject of study is the  <a href="https://www.accommodationforstudents.com/" target="_blank" rel="noopener noreferrer">Accommodation for Students</a>  website. Using this example, we'll examine the specifics of analyzing sites built with the Next.js framework and implement a crawler capable of efficiently extracting data without using browser emulation.</p>
<p>By the end of this article, you will have:</p>
<ul>
<li class="">A clear understanding of how to analyze sites with dynamic content rendered using JavaScript.</li>
<li class="">How to implement a crawler based on Crawlee for Python.</li>
<li class="">Insight into some of the details of working with sites that use <a href="https://nextjs.org/" target="_blank" rel="noopener noreferrer"><code>Next.js</code></a>.</li>
<li class="">A link to a GitHub repository with the full crawler implementation code.</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="website-analysis">Website analysis<a href="https://crawlee.dev/blog/scraping-dynamic-websites-using-python#website-analysis" class="hash-link" aria-label="Direct link to Website analysis" title="Direct link to Website analysis" translate="no">​</a></h2>
<p>To track all requests, open your Dev Tools and the <code>network</code> tab before entering the site. Some data may be transmitted only once the site is first opened.</p>
<p>As the site is intended for students in the UK, let's go to London. We'll start the analysis from the <a href="https://www.accommodationforstudents.com/search-results?location=London&amp;beds=0&amp;occupancy=min&amp;minPrice=0&amp;maxPrice=500&amp;latitude=51.509865&amp;longitude=-0.118092&amp;geo=false&amp;page=1" target="_blank" rel="noopener noreferrer">search page</a></p>
<p>Interacting with elements on the site page, you'll quickly notice a request of this type:</p>
<div class="language-plaintext codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-plaintext codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://www.accommodationforstudents.com/search?limit=22&amp;skip=0&amp;random=false&amp;mode=text&amp;numberOfBedrooms=0&amp;occupancy=min&amp;countryCode=gb&amp;location=London&amp;sortBy=price&amp;order=asc</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="Request type" src="https://crawlee.dev/assets/images/request-185e9cf4845c0b0f07c004d155563ea7.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>If we look at the format of the received response, we'll immediately notice that it comes in <a href="https://www.json.org/json-en.html" target="_blank" rel="noopener noreferrer"><code>JSON</code></a> format.</p>
<p><img decoding="async" loading="lazy" alt="JSON reposonse" src="https://crawlee.dev/assets/images/json-a85571ceba8b80c314af9a159db15511.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>Great, we're getting data in a structured format that's very convenient to work with. We see the total number of results links to listings are in the <code>url</code> attribute for each <code>properties</code> element</p>
<p>Let's also take a look at the server response headers.</p>
<p><img decoding="async" loading="lazy" alt="server response" src="data:image/webp;base64,UklGRtYhAABXRUJQVlA4IMohAACQgACdASpzAsMAPpFGnUslo6KhpTU5sLASCWVu/HyZjugB159PE/+GG39eku2dXh+QHbh9BX/T6e3oP/9nRJ+iD/G9Lb/6vYG/u3/n9gDzt/VF/1PSAf/31AP//1v/R7+I/jd3xf1f+1/sb5s/jfxT9a/UX/Yf3X1yP7Lwa8xf8L0H/j32e/O/2fyc/vX27+gvt6/gvtp+QL8f/in97/r35E++h7F/n+0KAB+X/zv/Xf3b8pvbH8n/xX9I/GL4F/IP53/o/7J+UH2Afxv+af6j+9fjx8O/3D/JeIN8y/w/68fAD/Lf53/uv73/o/2R+jv9h/5v95/wn7r+yD8y/tf/U/w3+m+QT+Zf0//f/3b/K++D///bt+2H//9179of///zAz0RHd6I/5A4Rh8ndm/Y1S59TB7Us0tq5QYGbSdMM9X6dk9N/jRMmT8W2VF0tPnwDa/knroHOydjn+QiIiHqsaatZI8dG9PTu0ZUaxbJUy8rWPSUTopp1wA87PAsLEBC2xm683GRd25pu08RIrVGIzPhsz4bM+GzPhskbKZUmlxNW7nhAwHT+g41APt8a1+JtLhD3GfMy6ZbsnX273JQ9qrLT6Pdmchf7mSgSeZUfLeEOqgKJ+zHcZWL+Seugc7J2OpS+KGxpEwmNwGxtOIdAuUD8qjp0XNDx9JvPcAXMwIaebnC0kdhveRhHBmJBzmhmoyBJon5GoG1/JPXQOdlADyMHJw9SVw02TRTdska074IKu3dmw0HCYGG28dbmrUnaa6bx5cvlv335VnTxJtfmdk7HP8hERERWPtwUCgWt5kGs96v+b0imC7jjD0D8o3XeRpOcRMfr+8cAeu+WPb/EAvW6BtY7hJrstZVnqjro390FyBwdQdQfr+kbwtl+VVVVVVVV0OPjBgsLFCymAj2tCer+SZPJix6KBhuyfFetCKPU8/UschSrlw1ipn2ffwPNvo9UYabXs6rSHuXcNcjfia9WQiIiIiIiIrHfT8uBgw34l1idjn+Pdm3/YBtP4ofOk2y4atiWiAFnL+MwQlf6nBLEWfMaPZcZr0sAbwL8cG6UYBQvjdAU8CtkSK9S/8VzQLkFkmoG1/JPYqXA0cFcJlfaBpDxNBmFJbkIiHb0nqQ+lfjiBmY3gCZHAKKBYLflpLz10DnZO2rRXhCS8NdY8UhCUD8qqQDXP/GpBElzXhMoOdk7HP8hERERERWPxpqON4I8GkC4BbnJH3yERDyKFJCobh6iOVyGjkRERERERERERERWPx4r1k2Z5Bqvjuk1emn2RVQi3Wv32kvgazVHXKGdhf2011r+Seugc7J2Of5LHzF/AY9B2W3ZNbDdsEhye82EKn3ZckOT325BU+7Lkhye+3IKmWxxOKkAAD8iVwV6FM8oPeFAAkF3G6dTjvULyCJHNxEp4viEx2xwk8efiS3rScf9NBdtKzhSkZyQ6dm67BccUM2aaR9dGml4yXoVlEllg3+ntsMAkU+2s5X1QCInTto1IO1gkXFuc7dG5gzZOGdmEtbpEhRHBmpjot0gmD9iUsrc7eRU9rsDYXVgju1C4ks62K1QlMrz3b5wV33kiQkyolwXrcNskZMhjG6gPJ7k8hPm/wXxpeR10foq+FukWMZo3DOAxl78rzF97SGnqh32/a3P6ItaMMNuk6xW/FBp+WjUyLvw4Pd77JG04QTW/VbMANfsHnQrfYMWzI6jKZOcjwWU7vw6cBzM5YPE+rVYlp6iJziQUg5v2qMN5L5Xks6rbhC9p9uUA3GOsfY9DXLpj9fhs7CBr37G68AYjeSoTTRd+86/3q5XDzHfTHOw/yfQ87ZVORqPaWLuPogk84E4OPjJggIr79vwgXr418nNbrHNdefgIiP6uECOi9byNzrjFVhUUcVkiVFihpbM0jF8BmMrZ8n31dcBxW9nFGzBlsABGVwD/ywDCLL0v/aqCerxpdKHQKTZrikGP6etfAwSzl5S1Atsohz9NzMNSN9bru+gRL1IKRxSGqgmsWdqpGpfts1+Y+8qWXwyg03r0pujVmdjU1HccQKWLx8pgA+PNm0W3rUwaM80RBepefKRsjDAerUeyBy9YFdhDcg0TZVJi6qxdE6nHQ47fxYYrhOVOJRwMSN05DXXZdSn8v5qIcCZYifz1x9l38ATW/fxMoyfOxoc4seZrAQAM8XXe/msU54uIU7g0++Z8qKYVDG4ljFBDbHRFl4NIZaKkdokocsKmdPKFJENwPtrfqYKZj4++DgizXoU80Xel/GUuDJ1o4+DaGZsWMAwqYvFMPEhwO4hBOL4W/dPc2Pi+lcKNlL9S4NLSmIH0GOENmzrSgQw6B0Cc7xKibh4lFkK+lUOrvl+m9vD11g68Y48Y4ha+HeZ2DziOXORj21mVULwvoizCQjETMPHGSMVvMCm1aWQHFQFz6hdFGMfpOV5Ec8Q9CZSqEp2aemIqMpzpCOS2tQ77KMOrZ20L0+GanZVCzIHQxeNAvtQCguWGfWwmLHfM13y7jhc1eQwnsnfa3C1Cd3YA5Kye4Eq1/FNhwD0UD1Nq8jSTIuvlCoaQL0mkjBZ74qGVK5h+AIeUQmaLTEh1ItKOj00Hx3vUC9ed5mOFigqu1aRWozgJlPw4LUOZdCyyd/aAdOgqceGx0uXGXfwPdCeieVyHem2KgZkRaisK8khH/aO0cuGbEpet7r4xZQDCljUll2jhgcaBuM7FrGmx/j9yjW+3xuyD6si063qPjU44YOksdp9vf+CzJLOlH4+wRGgPsUssp54isUlOhyOck0FTnAatXAvtSmLrh7C4TcChzowVLcKPDjHevX+sXC1ymQ7ppUvWheSbgal087V1wcVmxRiuxsY0pWrZZfIMs49WAUwKgs+kCRq5sEciauphVQqqUTxLAP1zmEy/r5zVJOSr/euDuOKnkaDZq5NWgPqmAh078aVPtKT0vnAoveM/mfjqMUkID9qjWlMhAPowX8Z16SzwVkwOwOpYjtVHHA3yjOlITOXyxIfynsLy3iSD3Bm4vlWjwM3lMGIwluBaZkKyKJsRzIaQwc9X1pddeUELx7GMU9z1BJs3MohIwOObNlGAmKAT9QsGI6c4YDjTKd2+/pqRXKYxxUInGweF7HXc8rzimvHU/By/SqKVtiUgPZpem0d6w4GOJpunov2VrL0YH1Crwn1wTwXLyWWXQwtmQ258yzADzZrgqJ9WDf66waAImP3Da35tzaRIM+p968rAy/tm6M8ZQyA4tQAraFWMrEny3uOXg6WaL2yL544Hlin4Rmq+hAx0wy+yubqTyPtgqn0n8EHQ+2wH/redI/1kr7RW6wRUGDXFtZ77XBO02F6PumAOKmJZz99YH82MycdUhr0/IMuRzXF2X/vhCpCHSy+byjdQMOne9NwB9NG49cPDsTL/Id1/YWiiB8kDguazgBpY5WrdmNflxBnN6Ivad5yjykBaH3+v2lR3sz9n87jSldZpCU/GkADqlhmxxuSWgnjtatIbO1RUc8sDM2OZlcZ1X5DnMM/PMfFky5yZIIOY6DRNKuixYkm1TFiQdkmb6nfAgTTLpkO2AQcj+NVwXzOuZt2enEMeYE+0zlfQnfZkXEMGWaEEZqZ+/a1gl75Z0r6qcHq/loNcFXEF5BVjbE/QDE4NYVeDZLkmW09DZxnUpjXpFA04WGiQfrWyHzBQSqjpeqGcXXtb0CDy2aqPqwLBk4UKySNwc+nuG1yUFVVhZNt/HR++O9ZvH8d6m2F4ziGeokEaUGep1fSO8dgK0CLBovRIfdgDQn12USruyIHg0YmuH3ZtVpSIhJp+GxAHGuJQ/yQM0tqagCe6Nt9LYCuDMyxgjIDF2n1Xyr3NCDlq9hh7G8VtpkocZ/0rFDWFXmykzyJZ7jiSbq58ZLMtwaWCH9edXlZ40JLFtoVy1d+CbO+1Afk3k5bkT5qf9xlUr2X76xmyJXyeiAvJpzkam0KX4DCMW5romH6o72Wk687WJq/6MZfvOb8Sud0Tb309MaIS4cpjCvKgPuq9pYV/2PvYFhtSMdcEeYjySnx1oUQlAAminZ8ByTWs2bWKtrpf/czL8hoZNKuHf+8e/+r+KaRAOucqZ1fNFRjrye3MAzo31NtVh+dnxx771o7Gl4RYeQWKP5PLa3RzVlEabW20JahvQgEpRBSddN4LiAiyOebiN5OdorDJYKyB2jCrAAffznb6hxXbmio0hBIQWFidvVIhG9bl6OkDFmyb9FHCOqCBdmzenJUF5b/Gjh7DjDTGzznHPek8Q48wdcsv1WnOD44smFmmO8PRCCc0UQFxXVxNtLdQOFhgWgdae/WMwmoXV1UKo90c4gRongtUMWJalAV5x3givUEg3KSDDm+JfuNXASsYReyVZor5Ti06RLAAv6dhYRRFRHFnGBwi+L0pAS/w+FA4dvxhhck9wiCS64l9lPAWEjRQCJW+IKskw7vdkvCZCxSQop2FbR0PZNQfBzhc+nMKjLJuTSm7zDzUzNynQTDv+MHdHnLU3ZQoY9V9dSiHEzKIA+//VlLJuI7wNVNCB3iSTNHK2CFKufbCPMhdqcUdHhNtVbqtAZYTMZ9LtXS+uMWWIivzf0+T4UaGggjuksh/6OYNIRRwvWKhlH4T80MFX2X8EXqWx2z4VZtPqZRsRu82kTosIy/HbMyWDJk72h7oisUWSGKS7bMa3btadxEQ9UD1cCiPCWO9mFVNMbs0EmWHHJMMeXxkoeRlDzcAXKsxGSQ07LGApOIzvlNd00b80O57HiTe1mnmBy0knQvORjtNjJbn9LIs9GDQJdzP8L+91z+mmS6E6VWnfB6gCZWzqUZvfDQYcUfyEdKASD6t1f2v0Rs/DlCcd/l0n0LxSfyq8iRSOsbUxgOE+lgX3LCPtIXXUQKVL0ostdt5nFeLN0bcOuKrkl2BwTT2A/Dk2QRQeIqJt5SIjo/NgFZqlD18Z2Evnu/JRIS3Mn111YfpBhGKnODhrIfr/sSI3Of8y8fKJ585AiNikV4+ZSwgBMCNgNGtPrMq8MwMn9foX2gM+HQQmlZiGqcsYEydPcUCcU6PmR2eYZSnAuLAZ5rZxKJruLuqy4gDiSLqAlaVSIVgwcDqJFKjPr0iqSK7XDuVY8JCpSor8TgABXrQ6dmrxSOaLzl13w3cdjDn3/tZsj+Hw2ByiBoBvc1+H7N22wrHFtYf3AhTs6NT54szLITMSvkWX06gbyMO7kwWHBdTChNoqytm5mRkNuNd+BOGp70jxWUFmTFUXvbG9IR2Myrj+oSsO0ca+sQtrHQKg2N3vtws2LZjDjYePrsEgZLLmj5WIw7CuHoCOLED/ebJ2RzM0BvddBzHt54oi3BHoUgAvOxVbh3TI+v4PxVIiscGN4KvzdK23CnwFPPBlgjHTPl1Ym4/0xFcnHOnV7mH9CJ1AyEQoDLWmQ03Nn9qTEnDAgnfAIWz9Nq01SmMuC8qDoAm9zLAtwy3pls/LguQTg/3OSLij6LS5hLLaL3mbQVNf1dRwOTzQI0V2VTdZFy5GAZjpPHzUL6k49HgCzrk04Nsasl/HHq+Igm9xhhNonEbCtIAGz6zz26EJvJ6pP6TC90wUbd5r2IAosIyk0coOYTwkEUtI1bg+p1QsQqMK+Q2d+KO24crHnX1+mPIklcLDudyZT53fyXRx1nDpgYfxSlJgJjv7UXbB7mq2aBhACZvAHRRwjqd3+WT50o3kVxx/DK/WItp83bXMRqDbD9lX0yOV0zLvC1grTy8fwyRhbm2EyMOEJhiC1xjbVGzXieUXhj63TbqDl7xlVALiuK9F8ruBS+N4gthlGWuf4qabMdpRPjINyto4Y7xJA4zpBUt9EpSBsG616TIlKKfjL3OUM5QKDkozw8CXnXsDbG5XVfOdzzyi+coamQyrYMuj5QqWN04oS/EjrU6x4BrSC2Tkm1sW6i6xcVaVUbwqwLk6yMd4CwNH6rQhjo0jVI+H59P4bFCk0Sx9IbiwlLZiVcfKzzLfgVCms5Ap07Cq2YiBqBhWF05S23jWqbR76nZy6/ijjQv7g5QzvReSZ0N2d1dQG9VZ+FokG0VUzVE0quOM23RUERELUSgdOEjeUgU4tjVkH4IsduaBf+mOOmy/AgnXlFrYuBKdsoSn10i19zTTFyyL75CqK0GgXXDvw/QiN9NZPE/AXhqsej6XL8WWmBKiLOXfTGOTjSxJUh4AdB+OOxP7iJ47o9SzvOyt6hYZtkDBHoBXii3qFyuz6KMHeFuihPEM2VXV1643BdEc2xefzs411YeK3wJemGrs09M9j6Ix6r7AhplBc/tdrMnQqRrUZtebXb92Fc+psxGwkyyYeU21ZNZquU1onpJcekWUBIAaFt8WqK4CEjtTXmgccLrOjT9y+VlYwq+ruuHhkwYqr4XK9pB3dk1XbLl5qLiz3D3Wluh0fqxAaFy4cGOv6jsuh6ajlW2fONprj7MvwQrWFI/U8oaJCHS83X+G2OznoHYogdmG+isXTGJkds7T+6Yc7JOGa1CixzF4Z8u8ZcdJ/cyNAxCD7yObXkQBmJmrnDatTbEdGCOc/qTrx5Kodwt4VE5QH0u4ecZk9EwWmCq3KMMpAhgeRjz+wRb5Fse0MQqzZzkoXq2dtEm3z7Fp6M1JkNdUb7DIik4qpWaJGRMv6erXaSvd0q6Oa7nm8TlSPgGW7Gd7z6MikJhys/IaJvVHIzPQPeuoWfYkRS+xK0PwN70VFvmDulya7nmgqCrUal0tbiIfZLuaSpeMwOAkRfvPCTO5hQA8Dn3kbUJ9d9FyelugnxbpgMMjIkdZ4cFzcziBbtsclQz7ObTYxHmvbn3SKZVrjcWjUX/bh+yvrlRoo6UgHL+9D2Ffk1kFUNfLLdNz+zmPjHImBcHTRK5Xkm0nTFD90m8iwYqj7HN6KZG6r/Czq5ToMFMS6S399MgxAy4aL+HjXF8a5POfbOf7hRLr0qX2Q1urKiJHpksbvRXI90jF/s+TgwFETpVaKzvC4YllPuO/MGZ6Q3+ybddwbdfk8DtSx924urg8P8dLifZccD3CGQmnlqkAp4oyAiJP9zAnCHrRSoNYW+tHMnWT7syw36lNXweQtrCGiy/7gwz3Q89YtbEFaCACt0gxe6Vl+Q5EGwBt78F8u04q2CB0zW34jIj+ZYPvzFbRGd2FOyDfeAjy+JgdYjJqpmUUlB1uXVwupQPvcKn/Y+Pur+ajiQSx4wKUVsS8LIY2+yBhfpR1W2iBG56y373PWXzlqLfeLo4ux/K/zVGkX+tnCtsbZdgLImwSE5CKV0z1CrRl/8L30mpPj0iBquaKlxHc1yJarSEWRntzT72/c0oC/2pr9cYeOdI0srhtSvBKiU8VRkBsRrMWHaKVIgQUH73Q3fl2n1h0H/w+pblNUf55ysTPk3xT5E2zQpZDbGs6N03ybNZxEQYdIZCmFP06wgqZ3DM+2gm8aRZgN+2uswo7/aN98dsaeU9Tx9AogJL6FP+ADr9hO/N+MQK1qXXYxWDBxJJr0DdVWAApe9WHPzFxbB2nH8iwtxpqlos4lAUFHzg7F/nBOD8/VUNP4fRYOOYz7dnsbgOIhZI0fuAXP13a2vmNuodlok+871sUHzKWaHBsCXbgyBD3C2F7n8R18icF6SyU7qHyl8nwSy3UFY/WC3avSExt5CDfmj91fFSMPsNBkJDd+6n7VgZVSUnlbQb9/2pHQ++B5cB7p6spgf7mco9mVm5gih/oLBCKm0dUCcJNNMGDZ/LheyVPLpYcnNIRZDBVhIkVJsprIvKWzHa3BCxbgrlu980BTGRYTWOwDGFiFNFVkw/8DEAKWqnDgS7mI9DiAuQ5NF9sNqu4QYMdIM4XRIX4SZzJltfVlGFwOpVQRyoBTT/MVD0zvfxCdAmLRfUS+9FJvEfOi8tZkuNa9JxPNTNIFpN6KrFwLTVHlaO0+O3eLNcKHxbtvuFJYCqeAZevfFvINOFFGDT9uny3ZR6lVYDcDtKHgpdCCe3s0+9OoF2k5hEyUljqoIIdIlpC5Ve0gtsl6fjZOn841p6ucQAOueEf2+KIUM4hdJQcxYhOz+ZSWlIEMB3JzwXaRnjzlZrQGMaIkIEJeHXjJYx9j0OqdyfckUcPfioT4DT3hURjXvlMzC7OtrhQqNZu59Vmex8NIJd5kWmREPnwrCMLGM2yQpkQx572TEMtkwgnQ5D9K9YcRQwEnJNe67ssX38fUbvjflVDqliXDDLTDg7NFBNBpClgxjBjSuaUiVNnJ6JyWkkn6UZxKmR7GxLB/aEMBdwsRWsMGzRnr2IEey47lZcnAsR8ejc9FFq1ayjb8ecAbKdfKpzEs3Y46rs2KXeF/BSNzCf3PNuDIjJPTT89m6tuEwHrSc6YuH/oCLDN7IvbBzALxxKRSse06ac/EeXt+h2OyvZj8h4e1EF6s5AtNU4gx/CdkUWB1QgRVOA8784rpHHR0WuNk0H8z6uU8nutO9lsDehgfHovDnFGw02d+YJt4rUW3iTEf7RLw8ZlzDd9X5h1iEtHQZJEazr1rzaRJd4CZMn7X9/gYZvPUqc4m0svU6JzSUKsuUFmcWuVoEkcQwk8svJ2L2R50i7DR6WhRf+TAcs2EN7efXdkBSkal4KOZRVO4FLz11wezTfO1Mq3AaZkkNV55yh18MehkzmJ0ptF+LXvtqsYnPLJSYL0e/47NXqTGqB1U+f7EjdqAlV9WtUPU+H9UmoJ1BVj5I+5W8ukGu33yICzNzz5iq9V/MYgwfvgw9HmJPjSX2Mq9V3Bs+updBzZUpMiWYGNTlzDbVfBzCiTYVI82kS6F1GDhyb6q49kZE2EbMDcyfjPsY5HZIYYsZfD/0gntEAGvTo/Ks6J4q/SrgVwA8WVLVhwxB2RKHG8J402Lf34XF15uWKdqc6rPVKRp3dniAN5pvA4gTkBFzf13h58lvRgu6XTJ0swNn0mndDWa4bs5oYa1InaM4UW2fh1KszptJfYRA9XINzRIvF8UpL8qEKedZzc8ySov7NBM09hoF6wTYXxOaZ/taI87y2eApFkL4s+z1xAwi0duxgY7ow21WQ7aWCl1KSB7UuxoeL/rhYFlQe6Aa0U9zk1mH4W98H0BQwnYFfLWxeEOu+HlkpkaATPupjUb6aqBnjboFYXXnd4ycIK1COlyP5JlCaBY5cHbihIOyyFXVk7fH0ppPztv/7BXK1ovsFTmRUDZCnyrypUHTIKzYFwcb0pLt3UN2S4ZTFg5s2B+hEKRPr2vZdAbXiK5eiS6U5bP9Z6mrw4soNdzQ978mviyF4GDL2ZcyTL82okXzVg8CIrGyXe2FiyNSohpy3d8yjMcgRI1zWaL/fMY9DiRkW/SjEcYozIBNFzJW37uKCa18WPsZqg6zoJ9pqiO5+TOBEgiFpgmx9/1Fi9avW/sUbeo/YYHSO1as/iMuw2jVLr2zjLsHO2tRobc+5jSLUwleaP2Pt4TecQlLZFZMgbi2brn10stGBQPrULHQ7y/ML3wnX/wRPtpr5hebHQS+OuoiC0NreVZeBzH9RelRe0VZA6TrNqOlO3yX36JhpXc44gbw1wQGLDn/70pcpyIjTP6gr1iadcosFI8WwPMQTxJRS5qQEuEDELIx1xJo88maFEm41xqJOPCwKeaLI7X5a9odcRDKNvkOWeglKcCssKUZX9Oi4/kANQbNP5l+6Gd+mUHRwB9beN8jAa+nGDkwrpfzJ2qvpV5Wn0eSc/euNxlL7emh3P4PYLoiUdz07+cpvzOqUtwau7dreNDO4j93YE49j/SPh3ssfotVq+bpT2DP4vNVOyrABsA3kzS6JMRyrJ/KPNqEV82X94FSC969CKUGOqtgr/dxorMwrZPRAojHUopBZTs0MfAFMquUnxKFsKLMYRM327L6tkTV31YkDRBTBfO5BwBaZbZ1YSq3DlpF36JjCpn19mzRq00Zyo1pT3YA0EOBXYnsrwjf6fLl3qLJM/B2aqvSrvWDVHAlwLeEN4juD0L3tDf/srsKkEiidA+m3gaciL5WkkoyOd6bf0upN9EgxxVRkNom0DTK5vNZM45VKNcw9y3OlOJKKlmzIMflikoYrbdjpRfQuDev1gPJa/AM+hNAydDIRuhpQpWuTUJPRGuwLsISVCJZ6zRZ1CFGsHFLiqvAT3NIfPXtf03e+tX2ZLO1H6UOlt5YWE5ynuKFsaY5abJ56G2awpIWQwEPXpqVMFK7pM8c5YvowyTBVs1A54EvL3m/+/V//WC13a4PAI0IPHO3KW3AyMnoYoqUSOHPwL6W7KwrMREylahhndK8x0dufG9+V3ZZSOXuCleiz504Gw6x1U8foRMdsR+eM0X7mmtroX4VT7jKfIAAAAAAWAixWvTN8hHfqjaFHDZ53VcY2OjVt92q5R12m/6WHcUiXszd2SVQ9wL1cONblBwWunZJ4qKFQ1cFvmL47L9hGtK9/NX5fH2RlaJkDlPQguWZS43jEAFRRR5Nyce23psI+gPXG7WX6vzGLqzzRu8U7lHtRi8bNujQai8+0gxMWvF4QzsRgTSraz9OkgEqUMCbLcBuepqSnzIhXUmF1SUazWVH7EiGIMtohcB9eZz2xvKbyMnAFbRgvVulxWEjPIBjMLTOpXyNMxghu9eihoaQIpKzKYDEADvD1Hi4XeaSsOoMZq3qRjLsv/v1yb3CQ26WLPVdQ9/E2VwxbdpEXYKmaO/YTfe/gjvJYSSzwMDf363VQkX7AAAkN6uWYbEONss22RdQW3Hbv1O2aj8dtsgrKrlH5ABD+GBYDKtGDJWbucEqvjmE0qiNod6NyUkwyJkDDPqlk75q0ejqXFwpfZwziykrne0mQXxHwBLHMGHQn4yHYOWGEEJRPvXRQ2HxpOqa6utjEIlQclaZ4o/XSrtpdDwijjiflryKib3fcYZpE7hm402MJTNqtMHesv9BoEVbdr5vIn7BhUYG6Jkusi/xSMhjtSX8cQj8Fc3dD8RSr8zXct55AYOrz71IK2RTKy9+pZBzSXOEVjS/XoKKK4E139pIRu0wqcqp8Yg9ItG3Ji3Q1N8C8aoEMcFDD3Y8hDsJAtgHCaeXM3JsUVyIRswn9dAROk1DAdxVhcjIQ85DDG3KRPEe6B1145UCTNo6HBkgEtWBTvz8W05YQAJi5VhrAYKZtq6dMU5cvLn4FOKt7UZGsrwY9AtV3ZKgUbYR51O0bTVtanI5RrxJgtp73zd0OAG5/yW31BEiEbLHJ97W91VBUjzPKWmgKlh5p+6msLybwvdYzxFKWto54IISmvCXN36Vr9HKbGITU/Eayt8E1ym3GGz17LY2hDNHKSPWIPJBg5U1MoLxT2S1QkmbJYBHvO7TadyRDeWm7u9YrbG/q8FcTPHo5OFKRk2qtxLv7HgzNzs3BJMnzKfLQs1YYMF9Ci2gpGNGRrm7jJmu5Mm+1lFWAXI6ouTuHF8vuZfmHS54I4Q5qt1ZmodC7BWAVXlPf7wthBOKqxAW3ANZvoYUG1Tnrs/MjxT/3jwz2SrCxX5QBrQvAj1IcQ/oEQrZ2gBynGMDQEclry+nFeEgAACuaBSnvWb8nuDAAAA" width="627" height="195" class="img_ev3q"></p>
<ul>
<li class=""><code>content-type: application/json; charset=utf-8</code> - It tells us that the server response comes in JSON format, which we've already confirmed visually.</li>
<li class=""><code>content-encoding: gzip</code> - It tells us that the response was compressed using <a href="https://www.gnu.org/software/gzip/" target="_blank" rel="noopener noreferrer"><code>gzip</code></a>, and therefore we should use appropriate decompression in our crawler.</li>
<li class=""><code>server: cloudflare</code> - The site is hosted on <a href="https://www.cloudflare.com/" target="_blank" rel="noopener noreferrer">Сloudflare</a> servers and uses their protection. We should consider this when creating our crawler.</li>
</ul>
<p>Great, let's also look at the parameters used in the search API request and make hypotheses about what they're responsible for:</p>
<ul>
<li class=""><code>limit: 22</code> - The number of elements we get per request.</li>
<li class=""><code>skip: 0</code> - The element from which we'll start getting important data for pagination.</li>
<li class=""><code>random: false</code> - We don't change the random sorting as we benefit from strict sorting.</li>
<li class=""><code>mode: text</code> - An unusual parameter. If you decide to conduct several experiments, you'll find that it can take the following values: text, fallback, geo. - Interestingly, the geo parameter completely changes the output, returning about 5400 options. I assume it's necessary to search by coordinates, and if we don't pass any coordinates, we get all the available results.</li>
<li class=""><code>numberOfBedrooms: 0 </code>- filter by bedrooms.</li>
<li class=""><code>occupancy: min</code> - filter by occupancy.</li>
<li class=""><code>countryCode: gb</code> - country code, in our case it's Great Britain</li>
<li class=""><code>location: London</code> - search location</li>
<li class=""><code>sortBy: price</code> - the field by which sorting is performed</li>
<li class=""><code>order: asc</code> - type of sorting</li>
</ul>
<p>But there's another important point to pay attention to. Let's look at our link in the browser bar, which looks like this:</p>
<div class="language-plaintext codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-plaintext codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://www.accommodationforstudents.com/search-results?location=London&amp;beds=0&amp;occupancy=min&amp;minPrice=0&amp;maxPrice=500&amp;latitude=51.509865&amp;longitude=-0.118092&amp;geo=false&amp;page=1</span><br></div></code></pre></div></div>
<p>In it, we see the coordinate parameters <code>latitude</code> and <code>longitude</code>, which don't participate in any way when interacting with the backend, and the <code>geo</code> parameter with a false value. This also confirms our hypothesis regarding the mode parameter. This is quite useful if you want to extract all data from the site.</p>
<p>Great. We can get the site's search data in a convenient JSON format. We also have flexible parameters to guarantee data extraction, whether all are available on the site or for a specific city.</p>
<p>Let's move on to analyzing the property page.</p>
<p>Since after clicking on the listing it opens in a new window, make sure you have <code>Auto-open DevTools for popups</code> option set in Dev Tools</p>
<p>Unfortunately, we don't see any interesting interaction with the backend after analyzing all requests. All listing data is obtained in one request containing HTML code and JSON elements.</p>
<p><img decoding="async" loading="lazy" alt="Listing data contained in HTML code and JSON elements" src="https://crawlee.dev/assets/images/listing-d32f6a4dabca952c5150d3e4705028fb.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>After carefully studying the page's source code, we can say that all the data we're interested in is in the JSON located in the <code>script</code> tag, which has an <code>id</code> attribute with the value <code>__NEXT_DATA__</code>. We can easily extract this JSON using a regular expression or HTML parser.</p>
<p>We already have everything necessary to build the crawler at this analysis stage. We know how to get data from the search, how pagination works, how to go from the search to the listing page, and where to extract the data we're interested in on the listing page.</p>
<p>But there's one obvious inconvenience: we get search data in JSON, and listing data we get in HTML inside, which is JSON. This isn't a problem but rather an inconvenience and higher traffic consumption, as such an HTML page will weigh much more than just JSON.</p>
<p>Let's continue our analysis.</p>
<p>The data in <code>__NEXT_DATA__</code> signals that the site uses the Next.js framework. Each framework has its own established internal patterns, parameters, and features.</p>
<p>Let's analyze the listing page again by refreshing it and analyzing the <code>.js</code> files we receive.</p>
<p><img decoding="async" loading="lazy" alt="Javascript files" src="https://crawlee.dev/assets/images/javascript-52ed58b7cb5fca440f94193ff7687de3.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>We're interested in the file containing <code>_buildManifest.js</code> in its name, the link to it will regularly change, so I'll provide an example:</p>
<div class="language-plaintext codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-plaintext codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://www.accommodationforstudents.com/_next/static/B5yLvSqNOvFysuIu10hQ5/_buildManifest.js</span><br></div></code></pre></div></div>
<p>This file contains all possible routes available on the site. After careful study, we can see a link format like <code>/property/[id]</code>, which is clearly related to the property page. After reading more about Next.js, we can get the final link—<code>https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json</code>.</p>
<p>This link has two variables:</p>
<ol>
<li class=""><code>build_id</code> - the current build of the <code>Next.js</code> application, it can be obtained from <code>__NEXT_DATA__</code> on any application page. In the example link for <code>_buildManifest.js</code>, its value is <code>B5yLvSqNOvFysuIu10hQ5</code></li>
<li class=""><code>id</code> - the identifier for the property object whose data we're interested in.</li>
</ol>
<p>Let's form a link and study the result in the browser.</p>
<p><img decoding="async" loading="lazy" alt="Study the result in browser" src="https://crawlee.dev/assets/images/result-64a7188999fd127f0b6e26bf94a4a7e5.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>As you can see, now we get the listing results in JSON format. But after all, <code>Next.js</code> works for search, so let's get a link for it, so that our future crawler interacts with only one API. It transforms from the link you see in the browser bar and will look like this:</p>
<div class="language-plaintext codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-plaintext codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&amp;page=[page]</span><br></div></code></pre></div></div>
<p>I think you immediately noticed that I excluded part of the search parameters, I did this because we simply don't need them. Coordinates aren't used in basic interaction with the backend. I plan that the crawler will search by location, so I keep the location and pagination parameters.</p>
<p>Let's summarize our analysis.</p>
<ol>
<li class="">For search pages, we'll use links of the format - <code>https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&amp;page=[page]</code></li>
<li class="">For listing pages, we'll use links of the format - <code>https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json</code></li>
<li class="">We need to get the <code>build_id</code>, let's use the main page of the site and a simple regular expression for this.</li>
<li class="">We need an HTTP client that allows bypassing Cloudflare, and we don't need any HTML parsers, as we'll get all target data from JSON.</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="crawler-implementation">Crawler implementation<a href="https://crawlee.dev/blog/scraping-dynamic-websites-using-python#crawler-implementation" class="hash-link" aria-label="Direct link to Crawler implementation" title="Direct link to Crawler implementation" translate="no">​</a></h2>
<p>I'm using Crawlee for Python version <code>0.3.5</code>, this is important, as the library is developing actively and will have more capabilities in higher versions. But this is an ideal moment to show how we can work with it for complex projects.</p>
<p>The library already has support for an HTTP client that allows bypassing Cloudflare - <a href="https://github.com/apify/crawlee-python/blob/v0.3.6/src/crawlee/http_clients/curl_impersonate.py" target="_blank" rel="noopener noreferrer"><code>CurlImpersonateHttpClient</code></a>. Since we have to work with JSON responses we could use <a href="https://github.com/apify/crawlee-python/tree/v0.3.5/src/crawlee/parsel_crawler" target="_blank" rel="noopener noreferrer"><code>parsel_crawler</code></a> added in version <code>0.3.0</code>, but I think this is excessive for such tasks, besides I like the high speed of <a href="https://github.com/ijl/orjson" target="_blank" rel="noopener noreferrer"><code>orjson</code></a>.. Therefore, we'll need to implement our crawler rather than using one of the ready-made ones.</p>
<p>As a sample crawler, we'll use <a href="https://github.com/apify/crawlee-python/tree/v0.3.5/src/crawlee/beautifulsoup_crawler" target="_blank" rel="noopener noreferrer">beautifulsoup_crawler</a></p>
<p>Let's install the necessary dependencies.</p>
<div class="language-Bash language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">curl-impersonate</span><span class="token punctuation" style="color:#393A34">]</span><span class="token operator" style="color:#393A34">==</span><span class="token number" style="color:#36acaa">0.3</span><span class="token plain">.5</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> orjson</span><span class="token operator" style="color:#393A34">&gt;=</span><span class="token number" style="color:#36acaa">3.10</span><span class="token plain">.7,</span><span class="token operator" style="color:#393A34">&lt;</span><span class="token number" style="color:#36acaa">4.0</span><span class="token plain">.0"</span><br></div></code></pre></div></div>
<p>I'm using <a href="https://pypi.org/project/orjson/" target="_blank" rel="noopener noreferrer"><code>orjson</code></a> instead of the standard <a href="https://docs.python.org/3/library/json.html" target="_blank" rel="noopener noreferrer"><code>json</code></a> module due to its high performance, which is especially noticeable in asynchronous applications.</p>
<p>Well, let's implement our custom_crawler.
Let's define the <code>CustomContext</code> class with the necessary attributes.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># custom_context.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> __future__ </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> annotations</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> dataclasses </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> dataclass</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> typing </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> TYPE_CHECKING</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">basic_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BasicCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpCrawlingResult</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> TYPE_CHECKING</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> collections</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">abc </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Callable</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@dataclass</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">frozen</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">CustomContext</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">HttpCrawlingResult</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> BasicCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Crawling context used by CustomCrawler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    page_data</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">dict</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># not `EnqueueLinksFunction`` because we are breaking protocol since we are not working with HTML</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># and we are not using selectors</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    enqueue_links</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Callable</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div></code></pre></div></div>
<p>Note that in my context, <code>enqueue_links</code> is just <code>Callable</code>, not <a href="https://github.com/apify/crawlee-python/blob/v0.3.5/src/crawlee/_types.py#L162" target="_blank" rel="noopener noreferrer"><code>EnqueueLinksFunction</code></a>. This is because we won't be using selectors and extracting links from HTML, which violate the agreed protocol. Still, I want the syntax in my crawler to be as close to standardized as possible.</p>
<p>Let's move on to the crawler functionality in the <code>CustomCrawler</code> class.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># custom_crawler.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> __future__ </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> annotations</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> logging</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> re </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> search</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> typing </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> TYPE_CHECKING</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Any</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Unpack</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Request</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">basic_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    BasicCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    BasicCrawlerOptions</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    BasicCrawlingContext</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ContextPipeline</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">errors </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SessionError</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_clients</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">curl_impersonate </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CurlImpersonateHttpClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> HttpCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> orjson </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> loads</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> afs_crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">constants </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> BASE_TEMPLATE</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> HEADERS</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">custom_context </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CustomContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> TYPE_CHECKING</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> collections</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">abc </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> AsyncGenerator</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Iterable</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">CustomCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">BasicCrawler</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">CustomContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""A crawler that fetches the request URL using `curl_impersonate` and parses the result with `orjson` and `re`."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">__init__</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token operator" style="color:#393A34">*</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        impersonate</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'chrome124'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        additional_http_error_status_codes</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Iterable</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">int</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ignore_http_error_status_codes</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Iterable</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">int</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token operator" style="color:#393A34">**</span><span class="token plain">kwargs</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Unpack</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">BasicCrawlerOptions</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">CustomContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_build_id </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_base_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> BASE_TEMPLATE</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        kwargs</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'_context_pipeline'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ContextPipeline</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">compose</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_make_http_request</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">compose</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_handle_blocked_request</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">compose</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_parse_http_response</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Initialize curl_impersonate http client using TLS preset and necessary headers</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        kwargs</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">setdefault</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'http_client'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            CurlImpersonateHttpClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                additional_http_error_status_codes</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">additional_http_error_status_codes</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                ignore_http_error_status_codes</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">ignore_http_error_status_codes</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                impersonate</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">impersonate</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HEADERS</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        kwargs</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">setdefault</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'_logger'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> logging</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getLogger</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">__name__</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token builtin">super</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">__init__</span><span class="token punctuation" style="color:#393A34">(</span><span class="token operator" style="color:#393A34">**</span><span class="token plain">kwargs</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>In <code>__init__</code>, we define that we'll use <code>CurlImpersonateHttpClient</code> as the <code>http_client</code>. Another important element is <code>_context_pipeline</code>, which defines the sequence of methods through which our context passes.</p>
<p><code>_make_http_request</code> - is completely identical to <code>BeautifulSoupCrawler</code>
<code>_handle_blocked_request</code> - since we get all data through the API, only the server response status will signal about blocking.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_handle_blocked_request</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> crawling_context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> CustomContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> AsyncGenerator</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">CustomContext</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_retry_on_blocked</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            status_code </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> crawling_context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">status_code</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> crawling_context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">session </span><span class="token keyword" style="color:#00009f">and</span><span class="token plain"> crawling_context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">session</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">is_blocked_status_code</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">status_code</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">status_code</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">raise</span><span class="token plain"> SessionError</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Assuming the session is blocked based on HTTP status code </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">status_code</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">yield</span><span class="token plain"> crawling_context</span><br></div></code></pre></div></div>
<p><code>_parse_http_response</code> - a function that encapsulates the main logic of parsing responses</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">_parse_http_response</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HttpCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> AsyncGenerator</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">CustomContext</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        page_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'content-type'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'text/html; charset=utf-8'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Get Build ID for Next js from the start page of the site, form a link to next.js endpoints</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            build_id </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> search</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">rb'"buildId":"(.{21})"'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">group</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_build_id </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> build_id</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">decode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'UTF-8'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_base_url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_base_url</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">build_id</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_build_id</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># Convert json to python dictionary</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            page_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            page_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> page_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">decode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'ISO-8859-1'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">encode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'utf-8'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            page_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> loads</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">page_data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token operator" style="color:#393A34">*</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> path_template</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> items</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">dict</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> Any</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            requests </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            user_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> user_data </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> user_data </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> items</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                link_user_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">copy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> label </span><span class="token keyword" style="color:#00009f">is</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    link_user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">setdefault</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'label'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> link_user_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'label'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'SEARCH'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    link_user_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'location'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> item</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">_base_url </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> path_template</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">item</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">item</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">**</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">append</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">link_user_data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">requests</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">yield</span><span class="token plain"> CustomContext</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            request</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            session</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">session</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            proxy_info</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">proxy_info</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            enqueue_links</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            add_requests</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            send_request</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">send_request</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            push_data</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            log</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            http_response</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">http_response</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            page_data</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">page_data</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>As you can see, if the server response comes in HTML, we get the <code>build_id</code> using a simple regular expression. This condition should be executed once for the first link and is necessary to interact further with the Next.js API. In all other cases, we simply convert JSON to a Python <code>dict</code> and save it in the context.</p>
<p>In <code>enqueue_links</code>, I create logic for generating links based on string templates and input parameters.</p>
<p>That's it: our custom Crawler Class for Crawlee for Python is ready, it's based on the <code>CurlImpersonateHttpClient</code> client, works with JSON responses instead of HTML, and implements the link generation logic we need.</p>
<p>Let's finalize it by defining public classes for import.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># init.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">custom_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CustomCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">types </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CustomContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">__all__ </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'CustomCrawler'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'CustomContext'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div></code></pre></div></div>
<p>Now that we have the crawler functionality, let's implement routing and data extraction from the site. We'll use the <a href="https://www.crawlee.dev/python/docs/introduction/refactoring" target="_blank" rel="noopener noreferrer"><code>official documentation</code></a> as a template.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># router.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">constants </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> LISTING_PATH</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> SEARCH_PATH</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> TARGET_LOCATIONS</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">custom_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CustomContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">CustomContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> CustomContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle the start URL to get the Build ID and create search links."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'default_handler is processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        path_template</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">SEARCH_PATH</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> items</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">TARGET_LOCATIONS</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'SEARCH'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'page'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'SEARCH'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">search_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> CustomContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle the SEARCH URL generates links to listings and to the next search page."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'search_handler is processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    max_pages </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'pageProps'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'initialPageCount'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    current_page </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'page'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> current_page </span><span class="token operator" style="color:#393A34">&lt;</span><span class="token plain"> max_pages</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            path_template</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">SEARCH_PATH</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            items</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">user_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'location'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'SEARCH'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            user_data</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">'page'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> current_page </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">else</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'Last page for </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">user_data</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">"location"</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> location'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    listing_ids </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        listing</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'property'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'id'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> group </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'pageProps'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'initialListings'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'groups'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> listing </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> group</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'results'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> listing</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'property'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">path_template</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">LISTING_PATH</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> items</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">listing_ids</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'LISTING'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'LISTING'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">listing_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> CustomContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handle the LISTING URL extracts data from the listings and saving it to a dataset."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'listing_handler is processing </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    listing_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'pageProps'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'viewModel'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'propertyDetails'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'exists'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">log</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f'listing_handler, data is not available for url </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">context</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    property_data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'property_id'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'id'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'property_type'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'propertyType'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'location_latitude'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'coordinates'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'lat'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'location_longitude'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'coordinates'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'lng'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'address1'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'address'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'address1'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'address2'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'address'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'address2'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'city'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'address'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'city'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'postcode'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'address'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'postcode'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'bills_included'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'terms'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'billsIncluded'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'bathrooms'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'numberOfBathrooms'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'number_rooms'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">len</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">listing_data</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'rooms'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'rooms'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'rent_ppw'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> listing_data</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'terms'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'rentPpw'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'value'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">property_data</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Let's define our <code>main</code> function, which will launch the crawler.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># main.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">custom_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> CustomCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The main function that starts crawling."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> CustomCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">50</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Run the crawler with the initial list of URLs.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://www.accommodationforstudents.com/'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'results.json'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Let's look at the results.</p>
<p><img decoding="async" loading="lazy" alt="Final results file" src="https://crawlee.dev/assets/images/final-results-f14b378f9aa0cbd5d1185301c49a222e.webp" width="1920" height="1032" class="img_ev3q"></p>
<p>As I prefer to manage my projects as packages and use <code>pyproject.toml</code> according to <a href="https://peps.python.org/pep-0518/" target="_blank" rel="noopener noreferrer">PEP 518</a>, the final structure of our project will look like this.</p>
<p><img decoding="async" loading="lazy" alt="PEP 518 file structure" src="data:image/webp;base64,UklGRvweAABXRUJQVlA4IPAeAADwhwCdASqcAWUBPpFIn0slpCKhpTMJILASCWdu4XYGQBqTpq/L/u3ck4DjnVMuyL5PxMYbPQx/vvTV9PHoM8wHnOelD/mdKz/zP///6vgp/pv/j9gD9pPXB9WX/b+mf6AH//4IHw9/fP5J29f2b8tv6r6U+F32n+q+b5jz6cNRH5D9y/1v+C8w/8h/X/EX4Rf4HqBfmX81/3Hh5/2P9V7iDT/8X/v/UC9gPnH/N/tv5Ue3L6P/of8Z6l/lX92/8H2RfYB/Jv61/sfTn/b+CZ9l/yX7EfAF/J/7j/0v8H/nPhT/k//p/k/OP+ff4//1/5j/R/IR/Pf7j/4P8j7a///9xP7of//3c/3S/////GytcvQHasOaljN4jfbAMP3aj0uKwxzitE1urEUBfir8sOZbIsBaxOyFN1ZxsujKiYe3uiyZlHyHFTweFHjPK1URpBrU+XWNWnA9A0vYaP4/6EbJFzorMq9iR+Plm5oyx+ha3T8PHrq2hzZo+ERrzTFlUcXtBp4WNL1kNPIkhoYBwF/BTh0OyomHmVpkgWBHE+K5iINDAkBhUm2AUcrnRYlhDGFEcv3iH5fAKcOh2U4iO9zRhCyv7bmqcikf8D6hU2TB8BlnHRk84wk0F+tlS1j1FZsh2VEw90OtXbwaGR+d/PJdE79KTvKaEVbsmbymJxm+NxIon+y+AU4dDskhu5Kvhkej5J6UuUJHX/huNCb2U30KTvqkrpOefjFh9OpiismcC6MqJh7odkkDAkQyCe5BOWJejasQCoohmYpTHoFgIwe6eR6qeKSXwfl8Apw6HZJBRo1B4P3OkhvhVvSmYSpLAbezGTvsvgFOHQ7KiWsFeycSkXHcEaD4CqGIJhGSPmQXB/+BpQbLoyomHuh2SQOEylR2g3VOsH+3Q+hU3ktI+QP4NaNTdvoIWmn3wCqFh6uWFeDEKId2XRlRMPdDskgavdoFkfuHgGLBVJ1LLCyAaIMGomq2Qs2Ubzm8vgFOHQ7KiWsJti6MkuwilBBLwNjccNxGsKwlg/pg8DKiYe6HZUS1gruZFMSGqpnn+HPw8Z4auCzvpyqFdfQ7KiYe6HZUTC5OM4HlQbSsk3cNXSsz095tovxmyaqz0xl0ZUTD3Q7KiWsKDo3sit7/ZqO+tQf8O6pPHyqyvnLTh0OyomHuh2U4gAkqvkpsdoUiRns1KxE17RaFRtNNYdDsqJh7obFs31CQk7PPbBHmXBBnt0wophbvqccTtD3HbJ4IYVEnWYqnY3fAuuCnDodlRLWC1b8W84JAJMpI0AcrYOwfeil83+pDMpaZNlQeVo6u6YR64LYqqsx9xBYw90OyomFi7pOUQuuZESQVRSFGszC67aTxXorrrj01CkqacOh2VEw90OyP+xVjELlDdUmD6gIaEEff8HA6p/AKcOh2VEw9z8EtxsbHsFOHQ7KiYe6GxhqnDodlRMPdDsqJh4gAAP6Vp35aMhVa8zeSdOjP1bW3C12q6jRXPfDditGwDlvzmWjxscBvqs8fRarg5Vx+RlQ9K8xOGQV/43/HnltVJqO/hDra9fw+yfQxE+E0gAuQTcw10FnRie7TA3APnvZQO8HY1iu/qmmaActKWtJ3/xfHs9VRHH7nufV4fIk7IFjBXUY1c5Vam2oY8fXJqhf8/1QQx3g/jmTMx5TBta565OgTUXDY69dhe4j5NNSaIPwKC1wZJH/uyPScCvApzbChkiMjMLNLuvGECSwrmozPrwo7PzfRZ1PNhlWvBHVqPGuVZMQd5R7whOSBxWRdMWnFeYObUOxmbCpqH5X9ojC6B3CgdHOB07pU/6VhmWHZ4twvLL/kYStsFGgd0MhLyZTwPPN5Qyu15BsiDGRV2w3fRoSIc3cWXWtraY7aABlXlofzhoUw2SBQ3xrm0jln9Njumb3mSmZuzGoJzFw85SnkhPNflxcydSYteMsnQg1+SR4v5hX8fs1TozPzxILpHOLe3hPggvcodHSKWsC/FMdjKzPxXhr9LbltlipUtAQzwLAi8swDvdikNDQonCQsUmu8aLG4FYpK/rtbS8tV8KFscHZdDGUPEK/mn3qBBtA72et7u0ZHvj87nFu+tuqlIJ1IuS/lA0RBqXOtZFEPw0EWw2aoE1QxlTHn7TrhahJIyBORmgVgvDGiyN1x7qKgB4xLuUWhfNFMf4jydTwT5wcQoorXUtfrEBfYplXP8PzFlMQJkBhgiN/5LqdyIHQ4OBlc7HKqgVw6JJpGdamcrrnUauRc02EWCaxDP4YFChaKn2j46UR/SQ7qNa6V0bzEomuyIJVh1SDSQU5PkB3KW0awzAgZoanm3i5NQinBKnzPpRpqa4Sd7iH7oUB12Ehv3ELp6BbPYD4Sz+G72fxH+9MAqYFWucfvZDTQbuuGnMyavfr539eXCf187+vLhP69NyhnmyF6au8jorx+vq7lwghYDK3Mi9NYv3yUV+TJhC/WhL84Zrp0228Pdi3QfVJjvZ+Y/JTmS1pWhG4rofgEtoBwMbvpuJmL8hzI7kvpapk975wrV79GuhJ0CTyZgn2YyTmUIIRWCO5AOXhTqLsw2LNFw97BEYF3Lv942HwG0wltWupxrMYhSPsLbQ8P05UcF1ekN4DHiluN/H8iVNeGdzjVBj4SqmLwZKMP6kQL3M+kjqMZnzLfQMEpNptlS0CltHyQLXLo2wk+kQZeJoWVJkL7FLz1z7HJE+V4uSTmZUs9jeXUQGlqQOYbARGQI3GbJhFnZq/zgyHQnXisz2RlE4+QtiE6hBQbCBdcY9a502n839m6iP/GwbobdUXPlS8cHyIfX0U0DFk4CqPGYagipU+2SJLyMRRF2yJVzpAABp84AW8gT2NFEIwRTTU1cPiiGA0TXVR6+4n5wWXN6MDpc3KRkQK9NFDM08Z00dUSZsYq8+49zqn2w4ICRSYMHwxz/uZP6ACVzvSJX49ivaKQTbwyvDrUK7JGnpzFe5tXe/0PH/yEx77yO/lf1Pdxqkni5C2en7FUp05CF8bF1RZqIGyDDsEAh0TNDPoHPt9mZ3RFf8soTXEnXWG6+sWNZ+7+l1gDL5ao3kcFDcwUiNGAu/Z/yyJUNCTE2sPz5jsDZnAmpbm1f8NZ2sE54ITzxM7eblJUzKpqnTAKFbAfknJM28EbZhbQQGh/QyLw7P71i+QD7QRyvfBYcDO98Rjv0iGZrHozkjQCmPq12Nkgge5rQ7BZvFHM419Edg2uQHvfDtUp0q0AncPMCnYAH+vlI73MXmW3czYKL41A+jZKURvuJ9uEVvx6BW6uodpBSeHaXQwqNE8eRqRMxvHRpVFf6sqM4ZZvBGG1BLrqNxmrnixoTjU4scsMEGBidxlLdW4xXw/DlgWftNZzyyJz7JFb7NthouBakIE0ujw9oF4hjMWi9PtSUdtatXxMGCXLXHIfVHuktHZii97u/ocJNA4+eSW+FQRuqPWc4aJmKSed49Tss6yI68WICmaHwci0KQrVyT7DarGYkz/OX0lqMbOjKB3iApw7xFr/ndCzl+XhTyJkdDZsue/ABrgEod5h6IBtUnKv6iaOLYn1zAzHBSF/Jvv2V9pxFIW8XhFJIsWVBCGMVSZYq7h3v4WgS1Fk3TzFiW2b0J9GFMsp0NHm3inUM+Id0DiWekRJoGhfXUdm9rX82ADp5k4YP23P54Rw3uT4YHb7cd8sVRepyQUkrGq13W4/LflMbf+jge1Kj/7y7wp6Fa0MRmogyaIDjPghIj4d7ket1QiJ/a5tob7iGGWe6zsY0dPZv8GIFJIjJ94P9cPPKPI+mmaa5pHp5a3PYoRY7b5Zz5aln3xFJU3OTDTjhBr0RCfyT5dn4cmw19QHVTI4WKVUUxVAnh0MPIroEpqifBJETmtxQMqC1i8sPLiwvbo6psOsMcbfA9+G0WL6QIUpH7he33mt99DTpxe1B188Dk03/5L8U+7mOG0+LilFMRWqVo5KWeM9UxryUvOV4nSNEvJamOdkm9faow74yGILDbtYFsWi/tIB44DAZKDs1TVzbxBBdMv/AInLvbeFUP2BFMqsCL/WzyTMcjQMHiisAIm1nCthUhmHYnlCacGQ0b/J2pqgpTopwi/HZSwtVxYQsz7G/p4YAYL3iNQGvDTUHHSf8E/U4I0N0xgXBzgDX0qAQQkZUHAfHlU0H4GjZcKpoxK/1HMv2Wz2iSkZECvNIUcjz7pvp7tW/pewLzPJ7I6fSwWcwbVX+K1BHkdzCLCAZH9A2W0fwsEldzOG/sWNJVj2K/uD8oDcDrd1MdEY3XnjxEu2rLAsDaThej8//Q1ja4sOR3KmJq7eF1MLaXpRcohSS9O+9cWcdJQlfL7XcYITkbqceTloWJKMyUSzXs3BM0HEK7Z2HFJVigj/8nZjoPpwb50PkjHZfLSxFmtFRfETWe0kyVlTuc/3StNNMLjwXNWQ0ZcqDvZclqnVkenZnLZXXpdQOH4YmlrB16mJbFVBAnnp3NgLz63GSLjrlNGhNTvqM2uDjh2A4vRxkpmeif22vuufr7zz3wJYd2Vx2xL82urlzEhwErox4pkwmyvwT9yQykDvGc8/XAJt8ZyZ8kLBgKACZLXKmoi02YL9mJUNzSLK4GrXrr2m9tqDHqWnYpsFvG5Me0goK5wj8TkiBvch1wkJ2wbzSncQUcNPBlORtrosD2b3zuVb2WuY2I2GquanN6Ak9XfzFe2/uS4RqYUWiNtbQywMIpq2Yc5rEDRcMVrhLgiFzhQtTAwlSG5bCuYxqA/uL/r3BO4jA2maGSuPdPsfQZthhMNYxnZqazk3yMNSHQHSMZqkKIVaM+yeelwA1eM9mWtObmt6Dtd2/2DRkRJ7rr6hC/Yjn/H0mDv51YgmZQwGsOaPlEwaUbNbnvTZ8ev2zeJ2J8Ch/wBove0igS4h6tIe+qJn4QH5Po5Imgj/PWDmOFHwimbJmiDZl9PpBHZjAshMt/56Gn1DKXuAHjnaZ4Lkk7R7tfx2w1ciM+ueVXi9cubfWen8bF8gquDLjNjjc1IMIUpbdsQWdldwjpemakFk6xq6Hl43QUSp1/e//H519qAaE4kddOUUa2A/hXEY8NY0weEE91vAsRrIEOZvRjgEa9aCN/21C7BVvszqRcqGEPWSmnxUEVMw6tgUfxq/Lc4ZvkyEOopnUFePZNwxdMvMDJ1mbyt6ZqiL+CkCan6NAyaoONmABeG9qiaIojniYWmyKg8fAAtXx6IUu8CJQVn1Y7eeaU1DdwmxC/YyfiyaDu+U2YEvI4YOUzqX967TRR4qHvCsHx7MiaeXKso9Gib63x9nKMc9k9v/lnGbwGo99lWYealdaLlqiyU/ue+5yEe86ufRLXR9y4UPoTRPTWDsdLfQvwC+IQWuan6JCMgWJ54GZTffEyr76eNwLnjV5azYEhcQFR/ujLi7Ts7fCGIg6LUNAupiekp8LTYKGWqkyNyMbTlGWozet3vgl5KTkQbO3OZ7seiHlzWH61KrmfHYkVk6P3TXB7KMgM/Gui09YPpsXwfFmDCgTro383VhXwKmPouHVxhx4xiBdmFFG3F+LCX9z5wCH4k1CCDo5MY9WpDZuypWtWYdApWAIFG0I3bkSdyctxQFq25J4TRg6ylXeIZYtkNq9vpKb/8mb6OTnvITGEQZnn8rDOxnBdnebKs+piX2BJfSMgg9q3Kk366U9BWRLFiUoVSE4FwqnIxz3GeIKpHRnf4gExe4AdCLvoifiHdshgxHhkv9VfOVZikJkXr5zthjXlhPCGkTi+GXwpy+A8GWbBbkexrfxEM4z1Ps4ALqkqgxB26eutjBVwjPHoae9AOSRfFlwtzOx2ZbG70forUpXhYASuO4eQJAo16sYnLpqj5dMTxAOJgs9sYFVV5u/khGtkjdQQrpbWWVfnyi88+Gsx1bcGeocmdCxRwbnkmY6dsD2xT5vVEUrOYAphbD1zv0IE2RaxkMeLrRDrrq1/ol2jjndT9IXI7Y2natu7fAamDTIv4LcSl3+XtV8bsFvTSANPDT5VOscx/rufD9ipIFun6T5TO9Icr2R6OL8Rjzyyy8kla/30a3+6D5aj5dMTxAUOv2UCmzgUhcPL/Jyy5k2i2ymAuaL9jSlsFQW0qZqs8gaEEBaiaG3t4Kvia9IepxLcqp8UTDAxTwsWyOTzM729gStRzGkWj3VCvVxRTzUw/cBbpj9kUIVmLFG6rp33MYYGzHWWgn+qoAWM5R2yj70xF+xx4khabHoeduoD/yp/L4H/6ENDsgWVL/XMIBGlZhEddgP1kgTW+GbnyL3Sowy/klH1Pddh6Fw1+hq6/yATzRE3cWF7dHLIXPwfJHXRiYku+lIyIFh6/t3pvp8zOduuwLKrx3aRtzr9hmHauIPUrc+GICcARLo6BOTDvGYXr1ReIL3LEsleVPOUa/3r6LSwMGvwIWEkDjBJwvm4YIYYZue4lYOjdjU9PajRHYvFaZbM74xOAVB6sG6bdYQNBFuAQrGSdBXBms4ekN/F+rpdTl7TnAczqZrk1aXGtycXzFaq8DFtM6Zz2MO1Yosij9ANhKTQoIt9onYeoRPraqAImT+rUapWDXca4zr2W54F8Am4quZHuFsX7XypOWgngdZrq3jONJS+Iwfdncc/LvIP0EUVxdB0/UVDfkGsKbgukYrGGTHkdLE4IRXmM2WGo4AUjKCkDQeVjN05blTpFpYc2kxacM4hjnmrwGD7Smzk8q7LJ1KdUcRpnc4vgVXm2WHKqEzNNXlx2xL82uhbolQPd3PbbBLysPpp4D6YDKosXaYEykMcv4VPYaEt55D2sVVQiXlRxsofkEX23BMVDBfZ4e2TZD5m5HUhte806vAIN8x9bANsdOQPS0NhthMJ9Ljl01ClBforhBR+mZZ0dM6BBwEjdxcOG+dsMqf6CO3dDGN4PX4dEiIGsTp/f2hfFYnAWgj7pR6Iars1r8+0B0686SemYMmZDgVo1+0abciLrnqA75HqjqUJKbe0cA96YyAOKGHZZblqh9tWen+yo6e7seo4Xi1NSaNdXKPGBS6VGUR9OijUQo74ZKqH5pXLB6tiyOcUclcZMxBnegonDqS7C3ozf+x0N/yGjKu7JjXkWNJs5uwMYAASp4OnVNbolD+rLMWc3gXPs+W+ozW533eV2l0MaDKd3Gtt3sp3ATBLwDZZ1Ppg8zE+JxiZa286NOfFtrzjgP6rafU0aog5GFsDqRB17hr6vo41bnnuaBHFGm2bPoOcINWappUQc6GddvMfIyz2IWVkmaOR6CEsKLCgj5hIl1YYNUycIjX3M5G9E62l7eAb+HuVeuTCDirm+RQLV6Gx4HwwrCrzeLhgG6Gh9nPmyjAqr8u6n7a0xhjkkGYfscuyJgTuFlFS39jE1/IAbaKry2Qel4bGzW3PK1yS9n7nwdvMqbgRgZUoW8howXOcFxPOybJFXw1Wi6JxJ6O2iU+4tVU21XUlGH7ewvN4COYtURvouMDZ18K6OVrDUvOKzYKJHk9EkmCtygUSRfCaDf4AkSloYs7hZRvfpkeqwCIm9N925ap4LqZmSbdnuBrnmRBkrDuz2/J/sYrZtHxEWnG/OgnhIhIvzn3SfehniPIYiQc6Z+KMPG/TprXqnIxC2ADLh8tJBvOcbEWM/2TfGxE1wjIYxniPEmHHHAi5yYe+x7Ywpl39Ryo3jIZtfQsP/uUwOTED4K8SvAm0MWTn3PU3+kKJA7sJi+hXkmuwGJci/0QPwwvHfG+giY4KTAfDcNWbOoHoA4VS0xNGkzxn3h0fBk4Rn9qIEbazmVsVBgATJqdOfjBAVBbOtizPNyF2goQCVM8F5g/czGfPY3zBnAhoO0crcPesUGGHuDVyGxPYi7cGQtPW0JBADvGyeDhMvVj1FGWYWeYzoEeMF72st04kuFcHgGcjuAh+Udyy28hT4xIYJZO/2Jerj3VR790YLzQAYbC2140dRwVfTY/ezqYIKo3uRJmgRr97A8F8xbczX14Kq+i4X76vO+vXZjVfzgokBvOqKILw5Yn8CbR3YDZ58j34yUuFd4ISYnS0jjv6mvon3xNknJfTCXy1LpOlBy5/UgN6ecuXHaODQzSrerODLJ9L1ecUY1LFswscM/m5RJ8lWxhpw+eDIPWcwB7O4ZfS96dZCKSAFNd4ZWUzHVhzm7FaFVDMPJ0IY6I8PuSt04gaLHZiRhGSVZ369XgmfcEpebfQUvsZ36FOa1pQ784fSNZpkWS4wwB9G+it+uO5oZde4HtKOgzEfDjsMRAN9OX80ekZDGn2NaLfPMAx1jgw7MGp+VUURP+bQawBywwtvQ97S8aXsHaab+73xq23pDuE+U03pmDG4vNJ58sIFp0Iin2pEuDja2zKJTAki1J9HdQ4VnYrArSRiOan8DbFwoDkyrTJMeZP+HFZf2Pbtol57ivitVUkoqMdijbvYaoQ1zhUIwRBVNi5CgDmQg2w6eKMz7GHUZdjYFODBATBfGyZUWAavTps1KuSVIdLhlVVrHpDLWgItlzL0ZTOoQdVmy+dVu91auQl7e0rI86k3AoeAu7IHMYp0S7M3NUvgLzmoMypROKrq6YkFKYLZssxtN51B/Aez+J4K9sHTHGzfJuWW3rnJlFckOPN2UGVd2yfGF/jkFm6ptNwI68D81QAQOYdSxGuGmc1EiedIoLTG10s+OP2JJB8AmQbaKcQV+mVlS0U4gQRwGdc1BuZqPEAvprquRZ5M4AERIlGuTGggCCstAPV/X8AiMIH1OZb7+z8Fmk5ULVOtfCSe2hrXzU//jF2A/PnWfGd2FkHaTd8+Um806f1n6wanz5yJR5ntjYNieu/oDmb9f/tTNR3l6nkPSpn3vfPmbvQwxwjasA8LtGbBw+gZHaVT2k9uTYptrZ6F48FbJcEj8VRwIyujpPuZJy92H7wB9HbOn/YHuPOHQuE3tv3pvtKK7woJofNKYodgn6xEX9IpwY4k6gHsFqZeyVlWZQbA0bc6/YZiDEQ6qAjXOzuzcU8vsKV2QC6vpmB0fzsUvmD686z7k0SAqsb3DlXv931kkO7YtwgiLyIY5lByE9RrqsGd2eAad0BjlZJK8hzK8BaKxk//oS1ygIoDqlwtLsKcAdY93H2RI2e8nj64pww6uX/8tvxob9tUFPA3u9vzFVG+8myQW1bIlj734NaJBP+wb98vzhR/l0gD6qTeC7rU4oJfRosavg7n2kqi1ZBXk0vozf4jgoNkyGcJBDOtlp4dalYet6GirZN1A4kqHjYiuho/Z90pbUGp3ZOiqdEZTNz8LbwEiGPbo+FAesw7T3smwkz7D3JS5Ye39bFaQxJVOJDSL72GwvgewVfrQEpdBfWTziuw3pn8v/QxpQtVGjGyuMzevpfMGf/VtjtGT/tdHW5XNcMruD8w9xgOOdCeQ7fExcjWSTNqVUtgtDpFqnJ9/e2k1GDgbFcRZmOQ8U8884M3QmsGAS+5FsmiQiEIgr3qDB/veYV5N5oz7CULbtNW0bY7wR1ZQizXJq4y1X5vsmAh8Amb+uwgnyhKOVFWn81jh+hGCR1DbPE/e7P+LHUvpbA+qhqc92KY2L/cXtJec4XVH7Igbu3JNe5o6R7xxLTnom3y7YAzQMBGIT4VPvOSP48PdaS9TlG/S1pVGItHGND79b6p43pwUnouyGk/Fib+UAeH40wx3vsYd+3A/dfqww6foSc0B5UhCX75zFShfE6sDIualTD6/X64L9pJaGPOw4zEaq4JQ9RVN7ShajYIERFhGo0BqOC06/1HDbli40+jLppxA/lHOksIwmWsp50Ok12ZH0GT2vgbpCK3JY4QKe/1WSBKeqdadH3LclvUdku85VP7YyKkkOSw/g88z8eTNaZ4smlkF02di1z1Q/oiv4ejuEKA1jrm1gWGbcXRn2WJmd9L31BspP/9+s/J3Zy5viDE07/1pV8xHabJQ4wMablz8wPtVJYZ7zmByO5ufUAHjwwkbOU0tC/tkF9+3mfFsfNPnb+pt3EC9r4h+gg33Ilu8Mok2OXnCVWWHmZbi6xXqxTYTe6bvRppOS40bCqgtquyNoVTMHCdM64ZLwFio3HCcCbAaMZZTXFPXa/SlzqjrlKLApGdz47VEcE453UtO9WKcJ+uj5xGkptnOql/mEUpn9Nl/u5X8TeQ/WOrNr2qtk8ugEGFGo2BqStSFnqusfEM7AiuZuiuraZt90Wg3bUQCMK0L6cn/QeEmfYANN8DFmczfOaLSr+nvabfb7TQfj0GMpm55W5lw8fjDGg/Qq/EMy6zHVR5K0ExJaLfNF2Tev67m6V+0jqMYdtveV3LAAMzWWVKDUW3WHM7mSFKk39o2w8cQ+MdOAsJgIDk4832vfsJiTkdAsjbSJXpH8+ouj6NCz2a00+9V5A7WfoHAP7YAwTJlLyzOBzmEaj2RmzyE7Puzs+3QFXm6TfoOLlwNI1AXX9J8VwfqyuCPydQRJ2EbpnNQJvcVOj5os8Dr9XDBTYv6v9n3cyiY9vzqGhJqMnmcMqbvYRUmdKn9ONCGJPM3XIJY79kMtpXnsLLsbGeHAvpm7d/So1D5LMFctpzYb/cGK97xWnoLd/GIoUEKAAAAAAA=" width="412" height="357" class="img_ev3q"></p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/scraping-dynamic-websites-using-python#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>In this project, we went through the entire cycle of crawler development, from analyzing a rather interesting dynamic site to full implementation of a crawler using <code>Crawlee for Python</code>. You can view the full project code on <a href="https://github.com/Mantisus/crawlee_python_example" target="_blank" rel="noopener noreferrer">GitHub</a></p>
<p>I would also like to hear your comments and thoughts on the web scraping topic you'd like to see in the next article. Feel free to comment here in the article or contact me in the <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">Crawlee developer community</a> on Discord.</p>
<p>If you are looking out to how to start scraping using Crawlee for Python, check out our <a href="https://blog.apify.com/crawlee-for-python-tutorial/" target="_blank" rel="noopener noreferrer">latest tutorial here</a>.</p>
<p>You can find me on the following platforms: <a href="https://github.com/Mantisus" target="_blank" rel="noopener noreferrer">Github</a>, <a href="https://www.linkedin.com/in/max-bohomolov/" target="_blank" rel="noopener noreferrer">Linkedin</a>, <a href="https://apify.com/mantisus" target="_blank" rel="noopener noreferrer">Apify</a>, <a href="https://www.upwork.com/freelancers/mantisus" target="_blank" rel="noopener noreferrer">Upwork</a>, <a href="https://contra.com/mantisus" target="_blank" rel="noopener noreferrer">Contra</a>.</p>
<p>Thank you for your attention. I hope you found this information useful.</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[How to scrape infinite scrolling webpages with Python]]></title>
            <link>https://crawlee.dev/blog/infinite-scroll-using-python</link>
            <guid>https://crawlee.dev/blog/infinite-scroll-using-python</guid>
            <pubDate>Tue, 27 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape infinite scrolling pages with Python and scrape Nike shoes using Crawlee for Python.]]></description>
            <content:encoded><![CDATA[<p>Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python.</p>
<p>For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more.</p>
<p>As a big sneakerhead, I'll take the Nike shoes infinite-scrolling <a href="https://www.nike.com/" target="_blank" rel="noopener noreferrer">website</a> as an example, and we'll scrape thousands of sneakers from it.</p>
<p><img decoding="async" loading="lazy" alt="How to scrape infinite scrolling pages with Python" src="https://crawlee.dev/assets/images/infinite-scroll-de1fd1c1791fdf8f6b5614a947ccc878.webp" width="1152" height="649" class="img_ev3q"></p>
<p>Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites-and-bootstrapping-the-project">Prerequisites and bootstrapping the project<a href="https://crawlee.dev/blog/infinite-scroll-using-python#prerequisites-and-bootstrapping-the-project" class="hash-link" aria-label="Direct link to Prerequisites and bootstrapping the project" title="Direct link to Prerequisites and bootstrapping the project" translate="no">​</a></h2>
<p>Let's start the tutorial by installing Crawlee for Python with this command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pipx run crawlee create nike-crawler</span><br></div></code></pre></div></div>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Before going ahead if you like reading this blog, we would be really happy if you gave <a href="https://github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">Crawlee for Python a star on GitHub!</a></p></div></div>
<p>We will scrape using headless browsers. Select <code>PlaywrightCrawler</code> in the terminal when Crawlee for Python asks for it.</p>
<p>After installation, Crawlee for Python will create boilerplate code for you. Redirect into the project folder and then run this command for all the dependencies installation:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">poetry </span><span class="token function" style="color:#d73a49">install</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="how-to-scrape-infinite-scrolling-webpages">How to scrape infinite scrolling webpages<a href="https://crawlee.dev/blog/infinite-scroll-using-python#how-to-scrape-infinite-scrolling-webpages" class="hash-link" aria-label="Direct link to How to scrape infinite scrolling webpages" title="Direct link to How to scrape infinite scrolling webpages" translate="no">​</a></h2>
<ol>
<li class="">
<p>Handling accept cookie dialog</p>
</li>
<li class="">
<p>Adding request of all shoes links</p>
</li>
<li class="">
<p>Extract data from product details</p>
</li>
<li class="">
<p>Accept Cookies context manager</p>
</li>
<li class="">
<p>Handling infinite scroll on the listing page</p>
</li>
<li class="">
<p>Exporting data to CSV format</p>
</li>
</ol>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="handling-accept-cookie-dialog">Handling accept cookie dialog<a href="https://crawlee.dev/blog/infinite-scroll-using-python#handling-accept-cookie-dialog" class="hash-link" aria-label="Direct link to Handling accept cookie dialog" title="Direct link to Handling accept cookie dialog" translate="no">​</a></h3>
<p>After all the necessary installations, we'll start looking into the files and configuring them accordingly.</p>
<p>When you look into the folder, you'll see many files, but for now, let's focus on <code>main.py</code> and <code>routes.py</code>.</p>
<p>In <code>main.py</code>, let's change the target location to the Nike website. Then, just to see how scraping will happen, we'll add <code>headless = False</code> to the <code>PlaywrightCrawler</code> parameters. Let's also increase the maximum requests per crawl option to 100 to see the power of parallel scraping in Crawlee for Python.</p>
<p>The final code will look like this:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">routes </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""The crawler entry point."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headless</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        request_handler</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">router</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        max_requests_per_crawl</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">100</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            'https</span><span class="token punctuation" style="color:#393A34">:</span><span class="token operator" style="color:#393A34">//</span><span class="token plain">nike</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">com</span><span class="token operator" style="color:#393A34">/</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Now coming to <code>routes.py</code>, let's remove:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>As we don't want to scrape the whole website.</p>
<p>Now, if you run the crawler using the command:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">poetry run python </span><span class="token parameter variable" style="color:#36acaa">-m</span><span class="token plain"> nike-crawler</span><br></div></code></pre></div></div>
<p>As the cookie dialog is blocking us from crawling more than one page's worth of shoes, let's get it out of our way.</p>
<p>We can handle the cookie dialog by going to Chrome dev tools and looking at the <code>test_id</code> of the "accept cookies" button, which is <code>dialog-accept-button</code>.</p>
<p>Now, let's remove the <code>context.push_data</code> call that was left there from the project template and add the code to accept the dialog in routes.py. The updated code will look like this:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">router </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Router</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">router </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Router</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">default_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Default request handler."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Wait for the popup to be visible to ensure it has loaded on the page.</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_by_test_id</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'dialog-accept-button'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">click</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="adding-request-of-all-shoes-links">Adding request of all shoes links<a href="https://crawlee.dev/blog/infinite-scroll-using-python#adding-request-of-all-shoes-links" class="hash-link" aria-label="Direct link to Adding request of all shoes links" title="Direct link to Adding request of all shoes links" translate="no">​</a></h3>
<p>Now, if you hover over the top bar and see all the sections, i.e., man, woman, and kids, you'll notice the “All shoes” section. As we want to scrape all the sneakers, this section interests us. Let's use <code>get_by_test_id</code> with the filter of <code>has_text='All shoes'</code> and add all the links with the text “All shoes” to the request handler. Let's add this code to the existing <code>routes.py</code> file:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">    shoe_listing_links </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_by_test_id</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'link'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">has_text</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'All shoes'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">all</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">add_requests</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            Request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">from_url</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'listing'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> link </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> shoe_listing_links</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url </span><span class="token operator" style="color:#393A34">:=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> link</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_attribute</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'href'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'listing'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">listing_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handler for shoe listings."""</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="extract-data-from-product-details">Extract data from product details<a href="https://crawlee.dev/blog/infinite-scroll-using-python#extract-data-from-product-details" class="hash-link" aria-label="Direct link to Extract data from product details" title="Direct link to Extract data from product details" translate="no">​</a></h3>
<p>Now that we have all the links to the pages with the title “All Shoes,” the next step is to scrape all the products on each page and the information provided on them.</p>
<p>We'll extract each shoe's URL, title, price, and description. Again, let's go to dev tools and extract each parameter's relevant <code>test_id</code>. After scraping each of the parameters, we'll use the <code>context.push_data</code> function to add it to the local storage. Now let's add the following code to the <code>listing_handler</code> and update it in the <code>routes.py</code> file:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'listing'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">listing_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handler for shoe listings."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">selector</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'a.product-card__link-overlay'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'detail'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'detail'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">detail_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handler for shoe details."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    title </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_by_test_id</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'product_title'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    price </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_by_test_id</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'currentPrice-container'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">first</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    description </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_by_test_id</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">'product-description'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">text_content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">loaded_url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> title</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'price'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> price</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'description'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> description</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="accept-cookies-context-manager">Accept Cookies context manager<a href="https://crawlee.dev/blog/infinite-scroll-using-python#accept-cookies-context-manager" class="hash-link" aria-label="Direct link to Accept Cookies context manager" title="Direct link to Accept Cookies context manager" translate="no">​</a></h3>
<p>Since we're dealing with multiple browser pages with multiple links and we want to do infinite scrolling, we may encounter an accept cookie dialog on each page. This will prevent loading more shoes via infinite scroll.</p>
<p>We'll need to check for cookies on every page, as each one may be opened with a fresh session (no stored cookies) and we'll get the accept cookie dialog even though we already accepted it in another browser window. However, if we don't get the dialog, we want the request handler to work as usual.</p>
<p>To solve this problem, we'll try to deal with the dialog in a parallel task that will run in the background. A context manager is a nice abstraction that will allow us to reuse this logic in all the router handlers. So, let's build a context manager:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> playwright</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">async_api </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> TimeoutError </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> PlaywrightTimeoutError</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@asynccontextmanager</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">accept_cookies</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    task </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">create_task</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get_by_test_id</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'dialog-accept-button'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">click</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">try</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">yield</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">finally</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">not</span><span class="token plain"> task</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">done</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            task</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">cancel</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> suppress</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">CancelledError</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> PlaywrightTimeoutError</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> task</span><br></div></code></pre></div></div>
<p>This context manager will make sure we're accepting the cookie dialog if it exists before scrolling and scraping the page. Let's implement it in the <code>routes.py</code> file, and the updated code is <a href="https://github.com/janbuchar/crawlee-python-demo/blob/6ca6f7f1d1bbbf789a3b86f14bec492cf756251e/crawlee-python-webinar/routes.py" target="_blank" rel="noopener noreferrer">here</a></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="handling-infinite-scroll-on-the-listing-page">Handling infinite scroll on the listing page<a href="https://crawlee.dev/blog/infinite-scroll-using-python#handling-infinite-scroll-on-the-listing-page" class="hash-link" aria-label="Direct link to Handling infinite scroll on the listing page" title="Direct link to Handling infinite scroll on the listing page" translate="no">​</a></h3>
<p>Now for the last and most interesting part of the tutorial! How to handle the infinite scroll of each shoe listing page and make sure our crawler is scrolling and scraping the data constantly.</p>
<p>This tutorial is taken from the webinar held on August 5th where Jan Buchar, Senior Python Engineer at Apify, gave a live demo about this use case. Watch the tutorial here:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/ip8Ii0eLfRY?si=7ZllUhMhuC7VC23B&amp;start=667" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe>
<p>To handle infinite scrolling in Crawlee for Python, we just need to make sure the page is loaded, which is done by waiting for the <code>network_idle</code> load state, and then use the <code>infinite_scroll</code> helper function which will keep scrolling to the bottom of the page as long as that makes additional items appear.</p>
<p>Let's add two lines of code to the <code>listing</code> handler:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token decorator annotation punctuation" style="color:#393A34">@router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'listing'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">listing_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token triple-quoted-string string" style="color:#e3116c">"""Handler for shoe listings."""</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> accept_cookies</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">wait_for_load_state</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'networkidle'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">infinite_scroll</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">enqueue_links</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            selector</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'a.product-card__link-overlay'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> label</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'detail'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="exporting-data-to-csv-format">Exporting data to CSV format<a href="https://crawlee.dev/blog/infinite-scroll-using-python#exporting-data-to-csv-format" class="hash-link" aria-label="Direct link to Exporting data to CSV format" title="Direct link to Exporting data to CSV format" translate="no">​</a></h2>
<p>As we want to store all the shoe data into a CSV file, we can just add a call to the <code>export_data</code> helper into the <code>main.py</code> file just after the crawler run:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">export_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'shoes.csv'</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="working-crawler-and-its-code">Working crawler and its code<a href="https://crawlee.dev/blog/infinite-scroll-using-python#working-crawler-and-its-code" class="hash-link" aria-label="Direct link to Working crawler and its code" title="Direct link to Working crawler and its code" translate="no">​</a></h2>
<p>Now, we have a crawler ready that can scrape all the shoes from the Nike website while handling infinite scrolling and many other problems, like the cookies dialog.</p>
<p>You can find the complete working crawler code here on the <a href="https://github.com/janbuchar/crawlee-python-demo" target="_blank" rel="noopener noreferrer">GitHub repository</a>.</p>
<p>Learn more about Crawlee for Python from our latest step by step <a href="https://blog.apify.com/crawlee-for-python-tutorial/" target="_blank" rel="noopener noreferrer">tutorial</a>.</p>
<p>If you have any doubts regarding this tutorial or using Crawlee for Python, feel free to <a href="https://apify.com/discord/" target="_blank" rel="noopener noreferrer">join our discord community</a> and ask fellow developers or the Crawlee team.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Current problems and mistakes of web scraping in Python and tricks to solve them!]]></title>
            <link>https://crawlee.dev/blog/common-problems-in-web-scraping</link>
            <guid>https://crawlee.dev/blog/common-problems-in-web-scraping</guid>
            <pubDate>Tue, 20 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Current problems and mistakes that developers encounters while scraping and crawling the internet with the advises and solution from an web scraping expert.]]></description>
            <content:encoded><![CDATA[<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="introduction">Introduction<a href="https://crawlee.dev/blog/common-problems-in-web-scraping#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction" translate="no">​</a></h2>
<p>Greetings! I'm <a href="https://apify.com/mantisus" target="_blank" rel="noopener noreferrer">Max</a>, a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing.</p>
<p>My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as <a href="https://www.import.io/" target="_blank" rel="noopener noreferrer">Import.io</a> and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when <a href="https://requests.readthedocs.io/en/latest/" target="_blank" rel="noopener noreferrer"><code>requests</code></a> and <a href="https://lxml.de/" target="_blank" rel="noopener noreferrer"><code>lxml</code></a>/<a href="https://beautiful-soup-4.readthedocs.io/en/latest/" target="_blank" rel="noopener noreferrer"><code>beautifulsoup</code></a> were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our <a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">discord channel</a>.</p></div></div>
<p>As a freelancer, I've built small solutions and large, complex data mining systems for products over the years.</p>
<p>Today, I want to discuss the realities of <a href="https://blog.apify.com/web-scraping-python/" target="_blank" rel="noopener noreferrer">web scraping with Python in 2024</a>. We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them.</p>
<p>Let's get started.</p>
<p>Just take <code>requests</code> and <code>beautifulsoup</code> and start making a lot of money...</p>
<p>No, this is not that kind of article.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-i-got-a-200-response-from-the-server-but-its-an-unreadable-character-set">1. "I got a 200 response from the server, but it's an unreadable character set."<a href="https://crawlee.dev/blog/common-problems-in-web-scraping#1-i-got-a-200-response-from-the-server-but-its-an-unreadable-character-set" class="hash-link" aria-label="Direct link to 1. &quot;I got a 200 response from the server, but it's an unreadable character set.&quot;" title="Direct link to 1. &quot;I got a 200 response from the server, but it's an unreadable character set.&quot;" translate="no">​</a></h2>
<p>Yes, it can be surprising. But I've seen this message from customers and developers six years ago, four years ago, and in 2024. I read a post on Reddit just a few months ago about this issue.</p>
<p>Let's look at a simple code example. This will work for <code>requests</code>, <a href="https://www.python-httpx.org/" target="_blank" rel="noopener noreferrer"><code>httpx</code></a>, and <a href="https://docs.aiohttp.org/en/stable/client.html#aiohttp-client" target="_blank" rel="noopener noreferrer"><code>aiohttp</code></a> with a clean installation and no extensions.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> httpx</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://www.wayfair.com/'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"User-Agent"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Language"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"en-US,en;q=0.5"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Encoding"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"gzip, deflate, br, zstd"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Connection"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"keep-alive"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> httpx</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">content</span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">:</span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The print result will be similar to:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">b</span><span class="token string" style="color:#e3116c">'\x83\x0c\x00\x00\xc4\r\x8e4\x82\x8a'</span><br></div></code></pre></div></div>
<p>It's not an error - it's a perfectly valid server response. It's encoded somehow.</p>
<p>The answer lies in the <code>Accept-Encoding</code> header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is <a href="https://github.com/google/brotli" target="_blank" rel="noopener noreferrer">Brotli</a>, and uses it as the most efficient method.</p>
<p>This can happen if none of the libraries listed above have a <code>Brotli</code> dependency among their standard dependencies. However, they all support decompression from this format if you already have <code>Brotli</code> installed.</p>
<p>Therefore, it's sufficient to install the appropriate library:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> Brotli</span><br></div></code></pre></div></div>
<p>This will allow you to get the result of the print:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">b</span><span class="token string" style="color:#e3116c">'&lt;!DOCTYPE '</span><br></div></code></pre></div></div>
<p>You can obtain the same result for <code>aiohttp</code> and <code>httpx</code> by doing the installation with extensions:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> aiohttp</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">speedups</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> httpx</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">brotli</span><span class="token punctuation" style="color:#393A34">]</span><br></div></code></pre></div></div>
<p>By the way, adding the <code>brotli</code> dependency was my first contribution to <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer"><code>crawlee-python</code></a>. They use <code>httpx</code> as the base HTTP client.</p>
<p>You may have also noticed that a new supported data compression format <a href="https://github.com/facebook/zstd" target="_blank" rel="noopener noreferrer"><code>zstd</code></a> appeared some time ago. I haven't seen any backends that use it yet, but <code>httpx</code> will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with <a href="https://github.com/Tinche/aiofiles" target="_blank" rel="noopener noreferrer"><code>aiofiles</code></a>.</p>
<p>The most common solution to this situation that I've seen is for developers to simply stop using the <code>Accept-Encoding</code> header, thus getting an uncompressed response from the server. Why is that bad? The <a href="https://www.wayfair.com/" target="_blank" rel="noopener noreferrer">main page of Wayfair</a> takes about 1 megabyte uncompressed and about 0.165 megabytes compressed.</p>
<p>Therefore, in the absence of this header:</p>
<ul>
<li class="">You increase the load on your internet bandwidth.</li>
<li class="">If you use a proxy with traffic, you increase the cost of each of your requests.</li>
<li class="">You increase the load on the server's internet bandwidth.</li>
<li class="">You're revealing yourself as a scraper, since any browser uses compression.</li>
</ul>
<p>But I think the problem is a bit deeper than that. Many web scraping developers simply don't understand what the headers they use do. So if this applies to you, when you're working on your next project, read up on these things; they may surprise you.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-i-use-headers-as-in-an-incognito-browser-but-i-get-a-403-response-heres-johnn--i-mean-cloudflare">2. "I use headers as in an incognito browser, but I get a 403 response". Here's Johnn-... I mean, Cloudflare<a href="https://crawlee.dev/blog/common-problems-in-web-scraping#2-i-use-headers-as-in-an-incognito-browser-but-i-get-a-403-response-heres-johnn--i-mean-cloudflare" class="hash-link" aria-label="Direct link to 2. &quot;I use headers as in an incognito browser, but I get a 403 response&quot;. Here's Johnn-... I mean, Cloudflare" title="Direct link to 2. &quot;I use headers as in an incognito browser, but I get a 403 response&quot;. Here's Johnn-... I mean, Cloudflare" translate="no">​</a></h2>
<p>Yes, that's right. 2023 brought us not only Large Language Models like ChatGPT but also improved <a href="https://www.cloudflare.com/" target="_blank" rel="noopener noreferrer">Cloudflare</a> protection.</p>
<p>Those who have been scraping the web for a long time might say, "Well, we've already dealt with DataDome, PerimeterX, InCapsula, and the like."</p>
<p>But Cloudflare has changed the rules of the game. It is one of the largest CDN providers in the world, serving a huge number of sites. Therefore, its services are available to many sites with a fairly low entry barrier. This makes it radically different from the technologies mentioned earlier, which were implemented purposefully when they wanted to protect the site from scraping.</p>
<p>Cloudflare is the reason why, when you start reading another course on "How to do web scraping using <code>requests</code> and <code>beautifulsoup</code>", you can close it immediately. Because there's a big chance that what you learn will simply not work on any "decent" website.</p>
<p>Let's look at another simple code example:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> httpx </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Client</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Client</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">http2</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://www.g2.com/'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"User-Agent"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Language"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"en-US,en;q=0.5"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Encoding"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"gzip, deflate, br, zstd"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Connection"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"keep-alive"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Of course, the response would be <a href="https://blog.apify.com/web-scraping-how-to-solve-403-errors/" target="_blank" rel="noopener noreferrer">403</a>.</p>
<p>What if we use <a href="https://curl.se/docs/manpage.html" target="_blank" rel="noopener noreferrer"><code>curl</code></a>?</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">curl</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-XGET</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Accept-Language: en-US,en;q=0.5'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Connection: keep-alive'</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://www.g2.com/'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-s</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-o</span><span class="token plain"> /dev/null </span><span class="token parameter variable" style="color:#36acaa">-w</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"%{http_code}</span><span class="token string entity" style="color:#36acaa">\n</span><span class="token string" style="color:#e3116c">"</span><br></div></code></pre></div></div>
<p>Also 403.</p>
<p>Why is this happening?</p>
<p>Because Cloudflare uses TLS fingerprints of many HTTP clients popular among developers, site administrators can also customize how aggressively Cloudflare blocks clients based on these fingerprints.</p>
<p>For <code>curl</code>, we can solve it like this:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">curl</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-XGET</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Accept-Language: en-US,en;q=0.5'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'Connection: keep-alive'</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://www.g2.com/'</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">--tlsv1.3</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-s</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-o</span><span class="token plain"> /dev/null </span><span class="token parameter variable" style="color:#36acaa">-w</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"%{http_code}</span><span class="token string entity" style="color:#36acaa">\n</span><span class="token string" style="color:#e3116c">"</span><br></div></code></pre></div></div>
<p>You might expect me to write here an equally elegant solution for <code>httpx</code>, but no. About six months ago, you could do the "dirty trick" and change the basic <a href="https://www.encode.io/httpcore/" target="_blank" rel="noopener noreferrer"><code>httpcore</code></a> parameters that it passes to <a href="https://github.com/python-hyper/h2" target="_blank" rel="noopener noreferrer"><code>h2</code></a>, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.</p>
<p>There are different approaches to getting around this. But let's solve it by manipulating TLS.</p>
<p>The bad news is that all the Python clients I know of use the <a href="https://docs.python.org/3/library/ssl.html" target="_blank" rel="noopener noreferrer"><code>ssl</code></a> library to handle TLS. And it doesn't give you the ability to manipulate TLS subtly.</p>
<p>The good news is that the Python community is great and implements solutions that exist in other programming languages.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="the-first-way-to-solve-this-problem-is-to-use-tls-client">The first way to solve this problem is to use <a href="https://github.com/FlorianREGAZ/Python-Tls-Client" target="_blank" rel="noopener noreferrer">tls-client</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#the-first-way-to-solve-this-problem-is-to-use-tls-client" class="hash-link" aria-label="Direct link to the-first-way-to-solve-this-problem-is-to-use-tls-client" title="Direct link to the-first-way-to-solve-this-problem-is-to-use-tls-client" translate="no">​</a></h3>
<p>This Python wrapper around the <a href="https://github.com/bogdanfinn/tls-client" target="_blank" rel="noopener noreferrer">Golang library</a> provides an API similar to <code>requests</code>.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> tls-client</span><br></div></code></pre></div></div>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> tls_client </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Session</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Session</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">client_identifier</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"firefox_120"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://www.g2.com/'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"User-Agent"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Language"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"en-US,en;q=0.5"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Encoding"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"gzip, deflate, br, zstd"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Connection"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"keep-alive"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The <code>tls_client</code> supports TLS presets for popular browsers, the relevance of which is maintained by developers. To use this, you must pass the necessary <code>client_identifier</code>. However, the library also allows for subtle manual manipulation of TLS.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="the-second-way-to-solve-this-problem-is-to-use-curl_cffi">The second way to solve this problem is to use <a href="https://github.com/yifeikong/curl_cffi" target="_blank" rel="noopener noreferrer">curl_cffi</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#the-second-way-to-solve-this-problem-is-to-use-curl_cffi" class="hash-link" aria-label="Direct link to the-second-way-to-solve-this-problem-is-to-use-curl_cffi" title="Direct link to the-second-way-to-solve-this-problem-is-to-use-curl_cffi" translate="no">​</a></h3>
<p>This wrapper around the C library patches curl and provides an API similar to <code>requests</code>.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip </span><span class="token function" style="color:#d73a49">install</span><span class="token plain"> curl_cffi</span><br></div></code></pre></div></div>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> curl_cffi </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> requests</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">url </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://www.g2.com/'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">headers </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"User-Agent"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Language"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"en-US,en;q=0.5"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Accept-Encoding"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"gzip, deflate, br, zstd"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"Connection"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"keep-alive"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">headers</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> impersonate</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"chrome124"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">response</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>curl_cffi also provides <a href="https://curl-cffi.readthedocs.io/en/latest/impersonate.html#supported-browser-versions" target="_blank" rel="noopener noreferrer">TLS presets</a> for some browsers, which are specified via the <code>impersonate</code> parameter. It also provides options for <a href="https://curl-cffi.readthedocs.io/en/latest/impersonate.html#how-to-use-my-own-fingerprints-other-than-the-builtin-ones-e-g-okhttp" target="_blank" rel="noopener noreferrer">subtle manual manipulation of TLS</a>.</p>
<p>I think someone just said, "They're literally doing the same thing." That's right, and they're both still very raw.</p>
<p>Let's do some simple comparisons:</p>
<table><thead><tr><th style="text-align:center">Feature</th><th style="text-align:center">tls_client</th><th style="text-align:center">curl_cffi</th></tr></thead><tbody><tr><td style="text-align:center">TLS preset</td><td style="text-align:center">+</td><td style="text-align:center">+</td></tr><tr><td style="text-align:center">TLS manual</td><td style="text-align:center">+</td><td style="text-align:center">+</td></tr><tr><td style="text-align:center">async support</td><td style="text-align:center">-</td><td style="text-align:center">+</td></tr><tr><td style="text-align:center">big company support</td><td style="text-align:center">-</td><td style="text-align:center">+</td></tr><tr><td style="text-align:center">number of contributors</td><td style="text-align:center">-</td><td style="text-align:center">+</td></tr></tbody></table>
<p>Obviously, <code>curl_cffi</code> wins in this comparison. But as an active user, I have to say that sometimes there are some pretty strange errors that I'm just unsure how to deal with. And let's be honest, so far, they are both pretty raw.</p>
<p>I think we will soon see other libraries that solve this problem.</p>
<p>One might ask, what about <a href="https://scrapy.org/" target="_blank" rel="noopener noreferrer"><code>Scrapy</code></a>? I'll be honest: I don't really keep up with their updates. But I haven't heard about <a href="https://www.zyte.com/" target="_blank" rel="noopener noreferrer">Zyte</a> doing anything to bypass TLS fingerprinting. So out of the box <code>Scrapy</code> will also be blocked, but nothing is stopping you from using <code>curl_cffi</code> in your Scrapy Spider.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-what-about-headless-browsers-and-cloudflare-turnstile">3. What about headless browsers and Cloudflare Turnstile?<a href="https://crawlee.dev/blog/common-problems-in-web-scraping#3-what-about-headless-browsers-and-cloudflare-turnstile" class="hash-link" aria-label="Direct link to 3. What about headless browsers and Cloudflare Turnstile?" title="Direct link to 3. What about headless browsers and Cloudflare Turnstile?" translate="no">​</a></h2>
<p>Yes, sometimes we need to use headless browsers. Although I'll be honest, from my point of view, they are used too often even when clearly not necessary.</p>
<p>Even in a headless situation, the folks at Cloudflare have managed to make life difficult for the average web scraper by creating a monster called Cloudflare Turnstile.</p>
<p>To test different tools, you can use this demo <a href="https://2captcha.com/demo/cloudflare-turnstile" target="_blank" rel="noopener noreferrer">page</a>.</p>
<p>To quickly test whether a library works with the browser, you should start by checking the usual non-headless mode. You don't even need to use automation; just open the site using the desired library and act manually.</p>
<p>What libraries are worth checking out for this?</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="candidate-1-playwright--playwright-stealth">Candidate #1 <a href="https://playwright.dev/python/docs/intro" target="_blank" rel="noopener noreferrer">Playwright</a> + <a href="https://github.com/AtuboDad/playwright_stealth" target="_blank" rel="noopener noreferrer">playwright-stealth</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#candidate-1-playwright--playwright-stealth" class="hash-link" aria-label="Direct link to candidate-1-playwright--playwright-stealth" title="Direct link to candidate-1-playwright--playwright-stealth" translate="no">​</a></h3>
<p>It'll be blocked and won't let you solve the captcha.</p>
<p>Playwright is a great library for browser automation. However the developers explicitly state that they don't plan to develop it as a web scraping tool.</p>
<p>And I haven't heard of any Python projects that effectively solve this problem.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="candidate-2-undetected_chromedriver">Candidate #2 <a href="https://github.com/ultrafunkamsterdam/undetected-chromedriver" target="_blank" rel="noopener noreferrer">undetected_chromedriver</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#candidate-2-undetected_chromedriver" class="hash-link" aria-label="Direct link to candidate-2-undetected_chromedriver" title="Direct link to candidate-2-undetected_chromedriver" translate="no">​</a></h3>
<p>It'll be blocked and won't let you solve the captcha.</p>
<p>This is a fairly common library for working with headless browsers in Python, and in some cases, it allows bypassing Cloudflare Turnstile. But on the target website, it is blocked. Also, in my projects, I've encountered at least two other cases where Cloudflare blocked undetected_chromedriver.</p>
<p>In general, undetected_chromedriver is a good library for your projects, especially since it uses good old Selenium under the hood.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="candidate-3-botasaurus-driver">Candidate #3 <a href="https://github.com/omkarcloud/botasaurus-driver" target="_blank" rel="noopener noreferrer">botasaurus-driver</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#candidate-3-botasaurus-driver" class="hash-link" aria-label="Direct link to candidate-3-botasaurus-driver" title="Direct link to candidate-3-botasaurus-driver" translate="no">​</a></h3>
<p>It allows you to go past the captcha after clicking.</p>
<p>I don't know how its developers pulled this off, but it works. Its main feature is that it was developed specifically for web scraping. It also has a higher-level library to work with - <a href="https://github.com/omkarcloud/botasaurus" target="_blank" rel="noopener noreferrer">botasaurus</a>.</p>
<p>On the downside, so far, it's pretty raw, and botasaurus-driver has no documentation and has a rather challenging API to work with.</p>
<p>To summarize, most likely, your main library for headless browsing will be <code>undetected_chromedriver</code>. But in some particularly challenging cases, you might need to use <code>botasaurus</code>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-what-about-frameworks">4. What about frameworks?<a href="https://crawlee.dev/blog/common-problems-in-web-scraping#4-what-about-frameworks" class="hash-link" aria-label="Direct link to 4. What about frameworks?" title="Direct link to 4. What about frameworks?" translate="no">​</a></h2>
<p>High-level frameworks are designed to speed up and ease development by allowing us to focus on business logic, although we often pay the price in flexibility and control.</p>
<p>So, what are the frameworks for web scraping in 2024?</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="scrapy"><a href="https://docs.scrapy.org/en/latest/" target="_blank" rel="noopener noreferrer">Scrapy</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#scrapy" class="hash-link" aria-label="Direct link to scrapy" title="Direct link to scrapy" translate="no">​</a></h3>
<p>It's impossible to talk about Python web scraping frameworks without mentioning Scrapy. Scrapinghub (now Zyte) first released it in 2008. For 16 years, it has been developed as an open-source library upon which development companies built their business solutions.</p>
<p>Talking about the advantages of <code>Scrapy</code>, you could write a separate article. But I will emphasize the two of them:</p>
<ul>
<li class="">The huge amount of tutorials that have been released over the years</li>
<li class="">Middleware libraries are written by the community and are extending their functionality. For example, <a href="https://github.com/scrapy-plugins/scrapy-playwright" target="_blank" rel="noopener noreferrer"><code>scrapy-playwright</code></a>.</li>
</ul>
<p>But what are the downsides?</p>
<p>In recent years, Zyte has been focusing more on developing its own platform. <code>Scrapy</code> mostly gets fixes only.</p>
<ul>
<li class="">Lack of development towards bypassing anti-scraping systems. You have to implement them yourself, but then, why do you need a framework?</li>
<li class=""><code>Scrapy</code> was originally developed with the asynchronous framework <code>Twisted</code>. Partial support for <code>asyncio</code> was added only in <a href="https://docs.scrapy.org/en/latest/topics/asyncio.html" target="_blank" rel="noopener noreferrer"><code>version 2.0</code></a>. Looking through the source code, you may notice some workarounds that were added for this purpose.</li>
</ul>
<p>Thus, <code>Scrapy</code> is a good and proven solution for sites that are not protected against web scraping. You will need to develop and add the necessary solutions to the framework in order to bypass anti-scraping measures.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="botasaurus"><a href="https://www.omkar.cloud/botasaurus/" target="_blank" rel="noopener noreferrer">Botasaurus</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#botasaurus" class="hash-link" aria-label="Direct link to botasaurus" title="Direct link to botasaurus" translate="no">​</a></h3>
<p>A new framework for web scraping using browser automation, built on <a href="https://github.com/omkarcloud/botasaurus-driver" target="_blank" rel="noopener noreferrer"><code>botasaurus-driver</code></a>. The initial commit was made on May 9, 2023.</p>
<p>Let's start with its advantages:</p>
<ul>
<li class="">Allows you to bypass any Claudflare protection as well as many others using <code>botasaurus-driver</code>.</li>
<li class="">Good documentation for a quick start</li>
</ul>
<p>Downsides include:</p>
<ul>
<li class="">Browser automation only, not intended for HTTP clients.</li>
<li class="">Tight coupling with <code>botasaurus-driver</code>; you can't easily replace it with something better if it comes out in the future.</li>
<li class="">No asynchrony, only multithreading.</li>
<li class="">At the moment, it's quite raw and still requires fixes for stable operation.</li>
<li class="">There are very few training materials available at the moment.</li>
</ul>
<p>This is a good framework for quickly building a web scraper based on browser automation. It lacks flexibility and support for HTTP clients, which is crutias for users like me.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="crawlee-for-python"><a href="https://www.crawlee.dev/python/" target="_blank" rel="noopener noreferrer">Crawlee for Python</a><a href="https://crawlee.dev/blog/common-problems-in-web-scraping#crawlee-for-python" class="hash-link" aria-label="Direct link to crawlee-for-python" title="Direct link to crawlee-for-python" translate="no">​</a></h3>
<p>A new framework for web scraping in the Python ecosystem. The initial commit was made on Jan 10, 2024, with a release in the media space on July 5, 2024.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>If you like the blog so far, please consider <a href="https://github.com/apify/crawlee" target="_blank" rel="noopener noreferrer">giving Crawlee a star on GitHub</a>, it helps us to reach and help more developers.</p></div></div>
<p>Developed by <a href="https://apify.com/" target="_blank" rel="noopener noreferrer">Apify</a>, it is a Python adaptation of their famous JS framework <a href="https://github.com/apify/crawlee" target="_blank" rel="noopener noreferrer"><code>crawlee</code></a>, first released on Jul 9, 2019.</p>
<p>As this is a completely new solution on the market, it is now in an active design and development stage. The community is also actively involved in its development. So,we can see that the use of <a href="https://github.com/apify/crawlee-python/issues/292" target="_blank" rel="noopener noreferrer">curl_cffi</a> is already being discussed. The possibility of creating their own Rust-based client was <a href="https://github.com/apify/crawlee-python/issues/80" target="_blank" rel="noopener noreferrer">previously discussed</a>. I hope the company doesn't abandon the idea.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>Crawlee team</div><div class="admonitionContent_BuS1"><p>"Yeah, for sure we will keep improving Crawlee for Python for years to come."</p></div></div>
<p>As I personally would like to see an HTTP client for Python developed and maintained by a major company. And Rust shows itself very well as a library language for Python. Let's remember at least <a href="https://docs.astral.sh/ruff/" target="_blank" rel="noopener noreferrer"><code>Ruff</code></a> and <a href="https://docs.pydantic.dev/latest/" target="_blank" rel="noopener noreferrer"><code>Pydantic</code></a> v2.</p>
<p>Advantages:</p>
<p>The framework was developed by an established company in the web scraping market, which has well-developed expertise in this sphere.</p>
<ul>
<li class="">Support for both browser automation and HTTP clients.</li>
<li class="">Fully asynchronous, based on <code>asyncio</code>.</li>
<li class="">Active development phase and media activity. As developers listen to the community, it is quite important in this phase.</li>
</ul>
<p>On a separate note, it has a pretty good modular architecture. If developers introduce the ability to switch between several HTTP clients, we will get a rather flexible framework that allows us to easily change the technologies used, with a simple implementation from the development team.</p>
<p>Deficiencies:</p>
<ul>
<li class="">The framework is new. There are very few training materials available at the moment.</li>
<li class="">At the moment, it's quite raw and still requires fixes for stable operation, as well as convenient interfaces for configuration.
-There is no implementation of any means of bypassing anti-scraping systems for now other than changing sessions and proxies. But they are being discussed.</li>
</ul>
<p>I believe that how successful <code>crawlee-python</code> turns out to depends primarily on the community. Due to the small number of tutorials, it is not suitable for beginners. However, experienced developers may decide to try it instead of <code>Scrapy</code>.</p>
<p>In the long run, it may turn out to be a better solution than Scrapy and Botasaurus. It already provides flexible tools for working with HTTP clients, automating browsers out of the box, and quickly switching between them. However, it lacks tools to bypass scraping protections, and their implementation in the future may be the deciding factor in choosing a framework for you.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://crawlee.dev/blog/common-problems-in-web-scraping#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>If you have read all the way to here, I assume you found it interesting and maybe even helpful :)</p>
<p>The industry is changing and offering new challenges, and if you are professionally involved in web scraping, you will have to keep a close eye on the situation. In some other field, you would remain a developer who makes products using outdated technologies. But in modern web scraping, you become a developer who makes web scrapers that simply don't work.</p>
<p>Also, don't forget that you are part of the larger Python community, and your knowledge can be useful in developing tools that make things happen for all of us. As you can see, many of the tools you need are being built literally right now.</p>
<p>I'll be glad to read your comments. Also, if you need a web scraping expert or do you just want to discuss the article, you can find me on the following platforms: <a href="https://github.com/Mantisus" target="_blank" rel="noopener noreferrer">Github</a>, <a href="https://www.linkedin.com/in/max-bohomolov/" target="_blank" rel="noopener noreferrer">Linkedin</a>, <a href="https://apify.com/mantisus" target="_blank" rel="noopener noreferrer">Apify</a>, <a href="https://www.upwork.com/freelancers/mantisus" target="_blank" rel="noopener noreferrer">Upwork</a>, <a href="https://contra.com/mantisus" target="_blank" rel="noopener noreferrer">Contra</a>.</p>
<p>Thank you for your attention :)</p>]]></content:encoded>
            <category>community</category>
        </item>
        <item>
            <title><![CDATA[Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers]]></title>
            <link>https://crawlee.dev/blog/launching-crawlee-python</link>
            <guid>https://crawlee.dev/blog/launching-crawlee-python</guid>
            <pubDate>Fri, 05 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Launching Crawlee for Python, a web scraping and automation library to build reliable scrapers in Python fastly.]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>Testimonial from early adopters</p>
<p>“Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.”</p>
<p>~ <a href="https://apify.com/mantisus" target="_blank" rel="noopener noreferrer">Maksym Bohomolov</a></p>
</blockquote>
<p>We launched Crawlee in <a href="https://blog.apify.com/announcing-crawlee-the-web-scraping-and-browser-automation-library/" target="_blank" rel="noopener noreferrer">August 2022</a> and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success.</p>
<p>Today, <a href="https://github.com/apify/crawlee" target="_blank" rel="noopener noreferrer">Crawlee built-in TypeScript</a> has nearly <strong>13,000 stars on GitHub</strong>, with 90 open-source contributors worldwide building the best web scraping and automation library.</p>
<p>Since the launch, the feedback we’ve received most often <a href="https://discord.com/channels/801163717915574323/999250964554981446/1138826582581059585" target="_blank" rel="noopener noreferrer">[1]</a><a href="https://discord.com/channels/801163717915574323/801163719198638092/1137702376267059290" target="_blank" rel="noopener noreferrer">[2]</a><a href="https://discord.com/channels/801163717915574323/1090592836044476426/1103977818221719584" target="_blank" rel="noopener noreferrer">[3]</a> has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does.</p>
<p>With all these requests in mind and to simplify the life of Python web scraping developers, <strong>we’re launching <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">Crawlee for Python</a> today.</strong></p>
<p>The new library is still in <strong>beta</strong>, and we are looking for <strong>early adopters</strong>.</p>
<p><img decoding="async" loading="lazy" alt="Crawlee for Python is looking for early adopters" src="https://crawlee.dev/assets/images/early-adopters-0c5f38327dd8e5fad85dc127dcabc1f0.webp" width="1920" height="1080" class="img_ev3q"></p>
<p>Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="why-use-crawlee-instead-of-a-random-http-library-with-an-html-parser">Why use Crawlee instead of a random HTTP library with an HTML parser?<a href="https://crawlee.dev/blog/launching-crawlee-python#why-use-crawlee-instead-of-a-random-http-library-with-an-html-parser" class="hash-link" aria-label="Direct link to Why use Crawlee instead of a random HTTP library with an HTML parser?" title="Direct link to Why use Crawlee instead of a random HTTP library with an HTML parser?" translate="no">​</a></h2>
<ul>
<li class="">Unified interface for HTTP &amp; headless browser crawling.<!-- -->
<ul>
<li class="">HTTP - HTTPX with BeautifulSoup,</li>
<li class="">Headless browser - Playwright.</li>
</ul>
</li>
<li class="">Automatic parallel crawling based on available system resources.</li>
<li class="">Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).</li>
<li class="">Automatic retries on errors or when you’re getting blocked.</li>
<li class="">Integrated proxy rotation and session management.</li>
<li class="">Configurable request routing - direct URLs to the appropriate handlers.</li>
<li class="">Persistent queue for URLs to crawl.</li>
<li class="">Pluggable storage of both tabular data and files.</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="understanding-the-why-behind-the-features-of-crawlee">Understanding the why behind the features of Crawlee<a href="https://crawlee.dev/blog/launching-crawlee-python#understanding-the-why-behind-the-features-of-crawlee" class="hash-link" aria-label="Direct link to Understanding the why behind the features of Crawlee" title="Direct link to Understanding the why behind the features of Crawlee" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="out-of-the-box-support-for-headless-browser-crawling-playwright">Out-of-the-box support for headless browser crawling (Playwright).<a href="https://crawlee.dev/blog/launching-crawlee-python#out-of-the-box-support-for-headless-browser-crawling-playwright" class="hash-link" aria-label="Direct link to Out-of-the-box support for headless browser crawling (Playwright)." title="Direct link to Out-of-the-box support for headless browser crawling (Playwright)." translate="no">​</a></h3>
<p>While libraries like Scrapy require additional installation of middleware, i.e, <a href="https://github.com/scrapy-plugins/scrapy-playwright" target="_blank" rel="noopener noreferrer"><code>scrapy-playwright</code></a> and still doesn’t work with Windows, Crawlee for Python supports a unified interface for HTTP &amp; headless browsers.</p>
<p>Using a headless browser to download web pages and extract data, <code>PlaywrightCrawler</code> is ideal for crawling websites that require JavaScript execution.</p>
<p>For websites that don’t require JavaScript, consider using the <code>BeautifulSoupCrawler,</code> which utilizes raw HTTP requests and will be much faster.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> asyncio</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> crawlee</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">playwright_crawler </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> PlaywrightCrawlingContext</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Create a crawler instance</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    crawler </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> PlaywrightCrawler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># headless=False,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># browser_type='firefox',</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token decorator annotation punctuation" style="color:#393A34">@crawler</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">router</span><span class="token decorator annotation punctuation" style="color:#393A34">.</span><span class="token decorator annotation punctuation" style="color:#393A34">default_handler</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">async</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">request_handler</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PlaywrightCrawlingContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'request_url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'page_url'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">url</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'page_title'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">title</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">'page_content'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">page</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">content</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">:</span><span class="token number" style="color:#36acaa">10000</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> context</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">push_data</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">await</span><span class="token plain"> crawler</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">'https://crawlee.dev'</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> __name__ </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'__main__'</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    asyncio</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">run</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The above example uses Crawlee’s built-in <code>PlaywrightCrawler</code> to crawl the <a href="https://crawlee.dev/">https://crawlee.dev/</a> website title and its content.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="small-learning-curve">Small learning curve<a href="https://crawlee.dev/blog/launching-crawlee-python#small-learning-curve" class="hash-link" aria-label="Direct link to Small learning curve" title="Direct link to Small learning curve" translate="no">​</a></h3>
<p>In other libraries like Scrapy, when you run a command to create a new project, you get many files. Then you need to learn about the architecture, including various components (spiders, middlewares, pipelines, etc.). <a href="https://crawlee.dev/blog/scrapy-vs-crawlee#language-and-development-environments">The learning curve is very steep</a>.</p>
<p>While building Crawlee, we made sure that the learning curve and the setup would be as fast as possible.</p>
<p>With <a href="https://github.com/apify/crawlee-python/tree/master/templates" target="_blank" rel="noopener noreferrer">ready-made templates</a>, and having only a single file to add the code, it's very easy to start building a scraper, you might need to learn a little about request handlers and storage, but that’s all.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="complete-type-hint-coverage">Complete type hint coverage<a href="https://crawlee.dev/blog/launching-crawlee-python#complete-type-hint-coverage" class="hash-link" aria-label="Direct link to Complete type hint coverage" title="Direct link to Complete type hint coverage" translate="no">​</a></h3>
<p>We know how much developers like their code to be high-quality, readable, and maintainable.</p>
<p>That's why the whole code base of Crawlee is fully type-hinted.</p>
<p>Thanks to that, you should have better autocompletion in your IDE, enhancing developer experience while developing your scrapers using Crawlee.</p>
<p>Type hinting should also reduce the number of bugs thanks to static type checking.</p>
<p><img decoding="async" loading="lazy" alt="Crawlee_Python_Type_Hint" src="https://crawlee.dev/assets/images/crawlee-python-type-hint-90bb0ec4fb86916d8a6b2512a80f965b.webp" width="877" height="457" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="based-on-asyncio">Based on Asyncio<a href="https://crawlee.dev/blog/launching-crawlee-python#based-on-asyncio" class="hash-link" aria-label="Direct link to Based on Asyncio" title="Direct link to Based on Asyncio" translate="no">​</a></h3>
<p>Crawlee is fully asynchronous and based on <a href="https://docs.python.org/3/library/asyncio.html" target="_blank" rel="noopener noreferrer">Asyncio</a>. For scraping frameworks, where many IO-bounds operations occur, this should be crucial to achieving high performance.</p>
<p>Also, thanks to Asyncio, integration with other applications or the rest of your system should be easy.</p>
<p>How is this different from the Scrapy framework, which is also asynchronous?</p>
<p>Scrapy relies on the "legacy" Twisted framework. Integrating Scrapy with modern Asyncio-based applications can be challenging, often requiring more effort and debugging <a href="https://stackoverflow.com/questions/49201915/debugging-scrapy-project-in-visual-studio-code" target="_blank" rel="noopener noreferrer">[1]</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="power-of-open-source-community-and-early-adopters-giveaway">Power of open source community and early adopters giveaway<a href="https://crawlee.dev/blog/launching-crawlee-python#power-of-open-source-community-and-early-adopters-giveaway" class="hash-link" aria-label="Direct link to Power of open source community and early adopters giveaway" title="Direct link to Power of open source community and early adopters giveaway" translate="no">​</a></h2>
<p>Crawlee for Python is fully open-sourced and the codebase is available on the <a href="https://github.com/apify/crawlee-python" target="_blank" rel="noopener noreferrer">GitHub repository of Crawlee for Python</a>.</p>
<p>We have already started receiving initial and very <a href="https://github.com/apify/crawlee-python/pull/226" target="_blank" rel="noopener noreferrer">valuable contributions from the Python community</a>.</p>
<blockquote>
<p>Early adopters also said:</p>
<p>“Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.”</p>
<p>~ <a href="https://apify.com/mantisus" target="_blank" rel="noopener noreferrer">Maksym Bohomolov</a></p>
</blockquote>
<p>There’s still room for improvement. Feel free to open issues, make pull requests, and <a href="https://github.com/apify/crawlee-python/" target="_blank" rel="noopener noreferrer">star the repository</a> to spread the work to other developers.</p>
<p><strong>We will award the first 10 pieces of feedback</strong> that add value and are accepted by our team with an exclusive Crawlee for Python swag (The first Crawlee for Python swag ever). Check out the <a href="https://github.com/apify/crawlee-python/issues/269/" target="_blank" rel="noopener noreferrer">GitHub issue here</a>.</p>
<p>With such contributions, we’re excited and looking forward to building an amazing library for the Python community.</p>
<p>Check out a step by step guide on how to use Crawlee for Python through one of our <a href="https://blog.apify.com/crawlee-for-python-tutorial/" target="_blank" rel="noopener noreferrer">latest tutorial</a>.</p>
<p><a href="https://apify.com/discord" target="_blank" rel="noopener noreferrer">Join our Discord community</a> with nearly 8,000 web scraping developers, where our team would be happy to help you with any problems or discuss any use case for Crawlee for Python.</p>]]></content:encoded>
        </item>
    </channel>
</rss>