Skip to main content

Proxy management

IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a proxy server.

With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers.

Quick start

If you already have proxy URLs of your own, you can start using them immediately in only a few lines of code.

import asyncio

from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)

# The proxy URLs are rotated in a round-robin.
proxy_url_1 = await proxy_configuration.new_url() # http://proxy-1.com/
proxy_url_2 = await proxy_configuration.new_url() # http://proxy-2.com/
proxy_url_3 = await proxy_configuration.new_url() # http://proxy-1.com/


if __name__ == '__main__':
asyncio.run(main())

Examples of how to use our proxy URLs with crawlers are shown below in Crawler integration section.

Proxy configuration

All our proxy needs are managed by the ProxyConfiguration class. We create an instance using the ProxyConfiguration constructor function based on the provided options.

Crawler integration

ProxyConfiguration integrates seamlessly into BeautifulSoupCrawler and PlaywrightCrawler.

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
context.log.info(f'Extracted data: {data}')

# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Our crawlers will now use the selected proxies for all connections.

IP Rotation and session management

The proxy_configuration.new_url() method allows us to pass a session_id parameter. This creates a session_id-proxy_url pair, ensuring that subsequent new_url() calls with the same session_id return the same proxy_url. This is extremely useful in scraping, because we want to create the impression of a real user. See the SessionPool class for more information on how maintaining a real session helps avoid blocking.

When no session_id is provided, our proxy URLs are rotated round-robin.

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(
proxy_configuration=proxy_configuration,
use_session_pool=True,
persist_cookies_per_session=True,
)

# ...

Inspecting current proxy in crawlers

The BeautifulSoupCrawler and PlaywrightCrawler provide access to information about the currently used proxy via the request handler using a proxy_info object. This object allows easy access to the proxy URL.

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Log the proxy used for the current request.
context.log.info(f'Proxy for the current request: {context.proxy_info}')

# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())