Skip to main content

Session management

The SessionPool class provides a robust way to manage the rotation of proxy IP addresses, cookies, and other custom settings in Crawlee. Its primary advantage is the ability to filter out blocked or non-functional proxies, ensuring that your scraper avoids retrying requests through known problematic proxies.

Additionally, it enables storing information tied to specific IP addresses, such as cookies, authentication tokens, and custom headers. This association reduces the probability of detection and blocking by ensuring cookies and other identifiers are used consistently with the same IP address.

Finally, it ensures even IP address rotation by randomly selecting sessions. This helps prevent overuse of a limited pool of available IPs, reducing the risk of IP bans and enhancing the efficiency of your scraper.

For more details on configuring proxies, refer to the Proxy management guide.

Now, let's explore examples of how to use the SessionPool in different scenarios:

import asyncio
import re

from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.sessions import SessionPool


async def main() -> None:
# To use the proxy IP session rotation logic, you must turn the proxy usage on.
proxy_configuration = ProxyConfiguration(
# options
)

# Initialize crawler with a custom SessionPool configuration
# to manage concurrent sessions and proxy rotation
crawler = BasicCrawler(
proxy_configuration=proxy_configuration,
# Activates the Session pool (default is true).
use_session_pool=True,
# Overrides default Session pool configuration.
session_pool=SessionPool(max_pool_size=100),
)

# Define the default request handler that manages session states
@crawler.router.default_handler
async def default_handler(context: BasicCrawlingContext) -> None:
# Send request, BasicCrawler automatically selects a session from the pool
# and sets a proxy for it. You can check it with `context.session`
# and `context.proxy_info`.
response = await context.send_request(context.request.url)

page_content = response.read().decode()
title_match = re.search(r'<title(?:.*?)>(.*?)</title>', page_content)

if context.session and (title := title_match.group(1) if title_match else None):
if title == 'Blocked':
context.session.retire()
elif title == 'Not sure if blocked, might also be a connection error':
context.session.mark_bad()
else:
context.session.mark_good() # BasicCrawler handles this automatically.

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

These examples demonstrate the basics of configuring and using the SessionPool.

Please, bear in mind that SessionPool requires some time to establish a stable pool of working IPs. During the initial setup, you may encounter errors as the pool identifies and filters out blocked or non-functional IPs. This stabilization period is expected and will improve over time.