Skip to main content

Proxy management

IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a proxy server.

With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers.

Quick start

If you already have proxy URLs of your own, you can start using them immediately in only a few lines of code.

import asyncio

from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)

# The proxy URLs are rotated in a round-robin.
proxy_url_1 = await proxy_configuration.new_url() # http://proxy-1.com/
proxy_url_2 = await proxy_configuration.new_url() # http://proxy-2.com/
proxy_url_3 = await proxy_configuration.new_url() # http://proxy-1.com/


if __name__ == '__main__':
asyncio.run(main())

Examples of how to use our proxy URLs with crawlers are shown below in Crawler integration section.

Proxy configuration

All our proxy needs are managed by the ProxyConfiguration class. We create an instance using the ProxyConfiguration constructor function based on the provided options.

Crawler integration

ProxyConfiguration integrates seamlessly into BeautifulSoupCrawler and PlaywrightCrawler.

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
context.log.info(f'Extracted data: {data}')

# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Our crawlers will now use the selected proxies for all connections.

IP Rotation and session management

The proxy_configuration.new_url() method allows us to pass a session_id parameter. This creates a session_id-proxy_url pair, ensuring that subsequent new_url() calls with the same session_id return the same proxy_url. This is extremely useful in scraping, because we want to create the impression of a real user. See the SessionPool class for more information on how maintaining a real session helps avoid blocking.

When no session_id is provided, our proxy URLs are rotated round-robin.

from crawlee.crawlers import BeautifulSoupCrawler
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(
proxy_configuration=proxy_configuration,
use_session_pool=True,
)

Tiered proxies

When you use HTTP proxies in real world crawling scenarios, you have to decide which type of proxy to use to reach the sweet spot between cost efficiency and reliably avoiding blocking. Some websites may allow crawling with no proxy, on some you may get away with using datacenter proxies, which are cheap but easily detected, and sometimes you need to use expensive residential proxies.

To take the guesswork out of this process, Crawlee allows you to configure multiple tiers of proxy URLs. When crawling, it will automatically pick the lowest tier (smallest index) where it doesn't encounter blocking. If you organize your proxy server URLs in tiers so that the lowest tier contains the cheapest, least reliable ones and each higher tier contains more expensive, more reliable ones, you will get an optimal anti-blocking performance.

In an active tier, Crawlee will alternate between proxies in a round-robin fashion, just like it would with proxy_urls.

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
tiered_proxy_urls=[
# No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
[None],
# lower tier, cheaper, preferred as long as they work
['http://cheap-datacenter-proxy-1.com/', 'http://cheap-datacenter-proxy-2.com/'],
# higher tier, more expensive, used as a fallback
['http://expensive-residential-proxy-1.com/', 'http://expensive-residential-proxy-2.com/'],
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Log the proxy used for the current request.
context.log.info(f'Proxy for the current request: {context.proxy_info}')

# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Inspecting current proxy in crawlers

The BeautifulSoupCrawler and PlaywrightCrawler provide access to information about the currently used proxy via the request handler using a proxy_info object. This object allows easy access to the proxy URL.

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Log the proxy used for the current request.
context.log.info(f'Proxy for the current request: {context.proxy_info}')

# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())