Skip to main content

Logging in with a crawler

Many websites require authentication to access their content. This guide demonstrates how to implement login functionality using both PlaywrightCrawler and HttpCrawler.

Session management for authentication

When implementing authentication, you'll typically want to maintain the same Session throughout your crawl to preserve login state. This requires proper configuration of the SessionPool. For more details, see our session management guide.

If your use case requires multiple authenticated sessions with different credentials, you can:

  • Use the new_session_function parameter in SessionPool to customize session creation.
  • Specify the session_id parameter in Request to bind specific requests to particular sessions.

For this guide, we'll use demoqa.com, a testing site designed for automation practice that provides a login form and protected content.

Login with Playwright crawler

The following example demonstrates how to authenticate on a website using PlaywrightCrawler, which provides browser automation capabilities for filling out logging forms.

Run on
import asyncio
from datetime import timedelta

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import (
PlaywrightCrawler,
PlaywrightCrawlingContext,
)
from crawlee.sessions import SessionPool


async def main() -> None:
crawler = PlaywrightCrawler(
max_requests_per_crawl=10,
headless=True,
browser_type='chromium',
# We only have one session and it shouldn't rotate
max_session_rotations=0,
# Limit crawling intensity to avoid blocking
concurrency_settings=ConcurrencySettings(max_tasks_per_minute=30),
session_pool=SessionPool(
# Limit the pool to one session
max_pool_size=1,
create_session_settings={
# High value for session usage limit
'max_usage_count': 999_999,
# High value for session lifetime
'max_age': timedelta(hours=999_999),
# High score allows the session to encounter more errors
# before crawlee decides the session is blocked
# Make sure you know how to handle these errors
'max_error_score': 100,
},
),
)

# The main handler for processing requests
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# A handler for the login page
@crawler.router.handler('login')
async def login_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing login {context.request.url} ...')

# Check if the session is available
if not context.session:
raise RuntimeError('Session not found')

# Entering data into the form, `delay` to simulate human typing
# Without this, the data will be entered instantly
await context.page.type('#userName', 'crawlee_test', delay=100)
await context.page.type('#password', 'Test1234!', delay=100)
await context.page.click('#login', delay=100)

# Wait for an element confirming that we have successfully
# logged in to the site
await context.page.locator('#userName-value').first.wait_for(state='visible')
context.log.info('Login successful!')

# Moving on to the basic flow of crawling
await context.add_requests(['https://demoqa.com/books'])

# We start crawling with login. This is necessary to access the rest of the pages
await crawler.run([Request.from_url('https://demoqa.com/login', label='login')])


if __name__ == '__main__':
asyncio.run(main())

Login with HTTP crawler

You can also use HttpCrawler (or its more specific variants like ParselCrawler or BeautifulSoupCrawler) to authenticate by sending a POST Request with your credentials directly to the authentication endpoint.

HTTP-based authentication often varies significantly between websites. Using browser DevTools to analyze the Network tab during manual login can help you understand the specific authentication flow, required headers, and body parameters for your target website.

Run on
import asyncio
import json
from datetime import datetime, timedelta

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import (
HttpCrawler,
HttpCrawlingContext,
)
from crawlee.sessions import SessionPool


async def main() -> None:
crawler = HttpCrawler(
max_requests_per_crawl=10,
# Configure to use a single persistent session throughout the crawl
max_session_rotations=0,
# Limit request rate to avoid triggering anti-scraping measures
concurrency_settings=ConcurrencySettings(max_tasks_per_minute=30),
session_pool=SessionPool(
max_pool_size=1,
create_session_settings={
# Set high value to ensure the session isn't replaced during crawling
'max_usage_count': 999_999,
# Set high value to prevent session expiration during crawling
'max_age': timedelta(hours=999_999),
# Higher error tolerance before the session is considered blocked
# Make sure you implement proper error handling in your code
'max_error_score': 100,
},
),
)

# Default request handler for normal page processing
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Specialized handler for the login API request
@crawler.router.handler('login')
async def login_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing login at {context.request.url} ...')

# Verify that a session is available before proceeding
if not context.session:
raise RuntimeError('Session not found')

# Parse the API response containing authentication tokens and user data
data = json.loads(context.http_response.read())

# Extract authentication data from the response
token = data['token']
expires = data['expires'].replace('Z', '+00:00')
expires_int = int(datetime.fromisoformat(expires).timestamp())
user_id = data['userId']
username = data['username']

# Set authentication cookies in the session that will be used
# for subsequent requests
context.session.cookies.set(name='token', value=token, expires=expires_int)
context.session.cookies.set(name='userID', value=user_id)
context.session.cookies.set(name='userName', value=username)

# After successful authentication, continue crawling with the
# authenticated session
await context.add_requests(['https://demoqa.com/BookStore/v1/Books'])

# Create a POST request to the authentication API endpoint
# This will trigger the login_handler when executed
request = Request.from_url(
'https://demoqa.com/Account/v1/Login',
label='login',
method='POST',
payload=json.dumps(
{'userName': 'crawlee_test', 'password': 'Test1234!'}
).encode(),
headers={'Content-Type': 'application/json'},
)

# Start the crawling process with the login request
await crawler.run([request])


if __name__ == '__main__':
asyncio.run(main())