Proxy management
IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a proxy server.
With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers.
Quick start
If you already have proxy URLs of your own, you can start using them immediately in only a few lines of code.
import asyncio
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
# The proxy URLs are rotated in a round-robin.
proxy_url_1 = await proxy_configuration.new_url() # http://proxy-1.com/
proxy_url_2 = await proxy_configuration.new_url() # http://proxy-2.com/
proxy_url_3 = await proxy_configuration.new_url() # http://proxy-1.com/
if __name__ == '__main__':
asyncio.run(main())
Examples of how to use our proxy URLs with crawlers are shown below in Crawler integration section.
Proxy configuration
All our proxy needs are managed by the ProxyConfiguration
class. We create an instance using the ProxyConfiguration
constructor function based on the provided options.
Crawler integration
ProxyConfiguration
integrates seamlessly into BeautifulSoupCrawler
and PlaywrightCrawler
.
- BeautifulSoupCrawler
- PlaywrightCrawler
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
context.log.info(f'Extracted data: {data}')
# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = PlaywrightCrawler(proxy_configuration=proxy_configuration)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
# Extract data from the page.
data = {
'url': context.request.url,
'title': await context.page.title(),
}
context.log.info(f'Extracted data: {data}')
# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Our crawlers will now use the selected proxies for all connections.
IP Rotation and session management
The proxy_configuration.new_url()
method allows us to pass a session_id
parameter. This creates a session_id
-proxy_url
pair, ensuring that subsequent new_url()
calls with the same session_id
return the same proxy_url
. This is extremely useful in scraping, because we want to create the impression of a real user. See the SessionPool
class for more information on how maintaining a real session helps avoid blocking.
When no session_id
is provided, our proxy URLs are rotated round-robin.
- BeautifulSoupCrawler
- PlaywrightCrawler
from crawlee.crawlers import BeautifulSoupCrawler
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(
proxy_configuration=proxy_configuration,
use_session_pool=True,
)
from crawlee.crawlers import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = PlaywrightCrawler(
proxy_configuration=proxy_configuration,
use_session_pool=True,
)
Tiered proxies
When you use HTTP proxies in real world crawling scenarios, you have to decide which type of proxy to use to reach the sweet spot between cost efficiency and reliably avoiding blocking. Some websites may allow crawling with no proxy, on some you may get away with using datacenter proxies, which are cheap but easily detected, and sometimes you need to use expensive residential proxies.
To take the guesswork out of this process, Crawlee allows you to configure multiple tiers of proxy URLs. When crawling, it will automatically pick the lowest tier (smallest index) where it doesn't encounter blocking. If you organize your proxy server URLs in tiers so that the lowest tier contains the cheapest, least reliable ones and each higher tier contains more expensive, more reliable ones, you will get an optimal anti-blocking performance.
In an active tier, Crawlee will alternate between proxies in a round-robin fashion, just like it would with proxy_urls
.
- BeautifulSoupCrawler
- PlaywrightCrawler
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
tiered_proxy_urls=[
# No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
[None],
# lower tier, cheaper, preferred as long as they work
['http://cheap-datacenter-proxy-1.com/', 'http://cheap-datacenter-proxy-2.com/'],
# higher tier, more expensive, used as a fallback
['http://expensive-residential-proxy-1.com/', 'http://expensive-residential-proxy-2.com/'],
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Log the proxy used for the current request.
context.log.info(f'Proxy for the current request: {context.proxy_info}')
# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
tiered_proxy_urls=[
# No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
[None],
# lower tier, cheaper, preferred as long as they work
['http://cheap-datacenter-proxy-1.com/', 'http://cheap-datacenter-proxy-2.com/'],
# higher tier, more expensive, used as a fallback
['http://expensive-residential-proxy-1.com/', 'http://expensive-residential-proxy-2.com/'],
]
)
crawler = PlaywrightCrawler(proxy_configuration=proxy_configuration)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
# Log the proxy used for the current request.
context.log.info(f'Proxy for the current request: {context.proxy_info}')
# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Inspecting current proxy in crawlers
The BeautifulSoupCrawler
and PlaywrightCrawler
provide access to information about the currently used proxy via the request handler using a proxy_info
object. This object allows easy access to the proxy URL.
- BeautifulSoupCrawler
- PlaywrightCrawler
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_configuration)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
# Log the proxy used for the current request.
context.log.info(f'Proxy for the current request: {context.proxy_info}')
# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Create a ProxyConfiguration object and pass it to the crawler.
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-1.com/',
'http://proxy-2.com/',
]
)
crawler = PlaywrightCrawler(proxy_configuration=proxy_configuration)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
# Log the proxy used for the current request.
context.log.info(f'Proxy for the current request: {context.proxy_info}')
# Run the crawler with the initial list of requests.
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())