Skip to main content

Session management

The SessionPool class provides a robust way to manage the rotation of proxy IP addresses, cookies, and other custom settings in Crawlee. Its primary advantage is the ability to filter out blocked or non-functional proxies, ensuring that your scraper avoids retrying requests through known problematic proxies.

Additionally, it enables storing information tied to specific IP addresses, such as cookies, authentication tokens, and custom headers. This association reduces the probability of detection and blocking by ensuring cookies and other identifiers are used consistently with the same IP address.

Finally, it ensures even IP address rotation by randomly selecting sessions. This helps prevent overuse of a limited pool of available IPs, reducing the risk of IP bans and enhancing the efficiency of your scraper.

For more details on configuring proxies, refer to the Proxy management guide.

Now, let's explore examples of how to use the SessionPool in different scenarios:

Run on
import asyncio
import re

from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.sessions import SessionPool


async def main() -> None:
# To use the proxy IP session rotation logic, you must turn the proxy usage on.
proxy_configuration = ProxyConfiguration(
# options
)

# Initialize crawler with a custom SessionPool configuration
# to manage concurrent sessions and proxy rotation
crawler = BasicCrawler(
proxy_configuration=proxy_configuration,
# Activates the Session pool (default is true).
use_session_pool=True,
# Overrides default Session pool configuration.
session_pool=SessionPool(max_pool_size=100),
)

# Define the default request handler that manages session states
@crawler.router.default_handler
async def default_handler(context: BasicCrawlingContext) -> None:
# Send request, BasicCrawler automatically selects a session from the pool
# and sets a proxy for it. You can check it with `context.session`
# and `context.proxy_info`.
response = await context.send_request(context.request.url)

page_content = response.read().decode()
title_match = re.search(r'<title(?:.*?)>(.*?)</title>', page_content)

if context.session and (title := title_match.group(1) if title_match else None):
if title == 'Blocked':
context.session.retire()
elif title == 'Not sure if blocked, might also be a connection error':
context.session.mark_bad()
else:
context.session.mark_good() # BasicCrawler handles this automatically.

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

These examples demonstrate the basics of configuring and using the SessionPool.

Please, bear in mind that SessionPool requires some time to establish a stable pool of working IPs. During the initial setup, you may encounter errors as the pool identifies and filters out blocked or non-functional IPs. This stabilization period is expected and will improve over time.

Configuring a single session

In some cases, you need full control over session usage. For example, when working with websites requiring authentication or initialization of certain parameters like cookies.

When working with a site that requires authentication, we typically don't want multiple sessions with different browser fingerprints or client parameters accessing the site. In this case, we need to configure the SessionPool appropriately:

Run on
import asyncio
from datetime import timedelta

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import BasicCrawlingContext, HttpCrawler, HttpCrawlingContext
from crawlee.errors import SessionError
from crawlee.sessions import SessionPool


async def main() -> None:
crawler = HttpCrawler(
# Limit requests per minute to reduce the chance of being blocked
concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
# Disable session rotation
max_session_rotations=0,
session_pool=SessionPool(
# Only one session in the pool
max_pool_size=1,
create_session_settings={
# High value for session usage limit
'max_usage_count': 999_999,
# High value for session lifetime
'max_age': timedelta(hours=999_999),
# High score allows the session to encounter more errors
# before crawlee decides the session is blocked
# Make sure you know how to handle these errors
'max_error_score': 100,
# 403 status usually indicates you're already blocked
'blocked_status_codes': [403],
},
),
)

# Basic request handling logic
@crawler.router.default_handler
async def basic_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Handler for session initialization (authentication, initial cookies, etc.)
@crawler.router.handler(label='session_init')
async def session_init(context: HttpCrawlingContext) -> None:
if context.session:
context.log.info(f'Init session {context.session.id}')

# Monitor if our session gets blocked and explicitly stop the crawler
@crawler.error_handler
async def error_processing(context: BasicCrawlingContext, error: Exception) -> None:
if isinstance(error, SessionError) and context.session:
context.log.info(f'Session {context.session.id} blocked')
crawler.stop()

await crawler.run([Request.from_url('https://example.org/', label='session_init')])


if __name__ == '__main__':
asyncio.run(main())

Binding requests to specific sessions

In the previous example, there's one obvious limitation - you're restricted to only one session.

In some cases, we need to achieve the same behavior but using multiple sessions in parallel, such as authenticating with different profiles or using different proxies.

To do this, use the session_id parameter for the Request object to bind a request to a specific session:

Run on
import asyncio
from datetime import timedelta
from itertools import count
from typing import Callable

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import BasicCrawlingContext, HttpCrawler, HttpCrawlingContext
from crawlee.errors import RequestCollisionError
from crawlee.sessions import Session, SessionPool


# Define a function for creating sessions with simple logic for unique `id` generation.
# This is necessary if you need to specify a particular session for the first request,
# for example during authentication
def create_session_function() -> Callable[[], Session]:
counter = count()

def create_session() -> Session:
return Session(
id=str(next(counter)),
max_usage_count=999_999,
max_age=timedelta(hours=999_999),
max_error_score=100,
blocked_status_codes=[403],
)

return create_session


async def main() -> None:
crawler = HttpCrawler(
# Adjust request limits according to your pool size
concurrency_settings=ConcurrencySettings(max_tasks_per_minute=500),
# Requests are bound to specific sessions, no rotation needed
max_session_rotations=0,
session_pool=SessionPool(
max_pool_size=10, create_session_function=create_session_function()
),
)

@crawler.router.default_handler
async def basic_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Initialize the session and bind the next request to this session if needed
@crawler.router.handler(label='session_init')
async def session_init(context: HttpCrawlingContext) -> None:
next_requests = []
if context.session:
context.log.info(f'Init session {context.session.id}')
next_request = Request.from_url(
'https://placeholder.dev', session_id=context.session.id
)
next_requests.append(next_request)

await context.add_requests(next_requests)

# Handle errors when a session is blocked and no longer available in the pool
# when attempting to execute requests bound to it
@crawler.failed_request_handler
async def error_processing(context: BasicCrawlingContext, error: Exception) -> None:
if isinstance(error, RequestCollisionError) and context.session:
context.log.error(
f'Request {context.request.url} failed, because the bound '
'session is unavailable'
)

# Create a pool of requests bound to their respective sessions
# Use `always_enqueue=True` if session initialization happens on a non-unique address,
# such as the site's main page
init_requests = [
Request.from_url(
'https://example.org/',
label='session_init',
session_id=str(session_id),
use_extended_unique_key=True,
)
for session_id in range(1, 11)
]

await crawler.run(init_requests)


if __name__ == '__main__':
asyncio.run(main())