Creating web archive
Archiving webpages is one of the tasks that a web crawler can be used for. There are various use cases, such as archiving for future reference, speeding up web crawler development, creating top-level regression tests for web crawlers and so on.
There are various existing libraries of web archives with massive amount of data stored during their years of existence, for example Wayback Machine or Common Crawl. There are also dedicated tools for archiving web pages, to name some: simple browser extensions such as Archive Webpage, open source tools such as pywb or warcio, or even web crawlers specialized in archiving such as Browsertrix.
The common file format used for archiving is WARC. Crawlee does not offer any out-of-the-box functionality to create WARC files, but in this guide, we will show examples of approaches that can be easily used in your use case to create WARC files with Crawlee.
Crawling through proxy recording server
This approach can be especially attractive as it does not require almost any code change to the crawler itself and the correct WARC creation is done by code from well maintained pywb package. The trick is to run a properly configured wayback proxy server, use it as a proxy for the crawler and record any traffic. Another advantage of this approach is that it is language agnostic. This way, you can record both your Python-based crawler and your JavaScript-based crawler. This is very straightforward and a good place to start.
This approach expects that you have already created your crawler, and that you just want to archive all the pages it is visiting during its crawl.
Install pywb which will allow you to use wb-manager
and wayback
commands.
Create a new collection that will be used for this archiving session and start the wayback server:
wb-manager init example-collection
wayback --record --live -a --auto-interval 10 --proxy example-collection --proxy-record
Instead of passing many configuration arguments to wayback
command, you can configure the server by adding configuration options to config.yaml
. See the details in the documentation.
Configure the crawler
Now you should use this locally hosted server as a proxy in your crawler. There are two more steps before starting the crawler:
- Make the crawler use the proxy server.
- Deal with the pywb Certificate Authority.
For example, in PlaywrightCrawler
, this is the simplest setup, which takes the shortcut and ignores the CA-related errors:
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
crawler = PlaywrightCrawler(
# Use the local wayback server as a proxy
proxy_configuration=ProxyConfiguration(proxy_urls=['http://localhost:8080/']),
# Ignore the HTTPS errors if you have not followed pywb CA setup instructions
browser_launch_options={'ignore_https_errors': True},
max_requests_per_crawl=10,
headless=False,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Archiving {context.request.url} ...')
# For some sites, where the content loads dynamically,
# it is needed to scroll the page to load all content.
# It slows down the crawling, but ensures that all content is loaded.
await context.infinite_scroll()
await context.enqueue_links(strategy='same-domain')
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
After you run the crawler you will be able to see the archived data in the wayback collection directory for example .../collections/example-collection/archive
. You can then access the recorded pages directly in the proxy recording server or use it with any other WARC-compatible tool.
Manual WARC creation
A different approach is to create WARC files manually in the crawler, which gives you full control over the WARC files. This is way more complex and low-level approach as you have to ensure that all the relevant data is collected, and correctly stored and that the archiving functions are called at the right time. This is by no means a trivial task and the example archiving functions below are just the most simple examples that will be insufficient for many real-world use cases. You will need to extend and improve them to properly fit your specific needs.
Simple crawlers
With non-browser crawlers such as ParselCrawler
you will not be able to create high fidelity archive of the page as you will be missing all the JavaScript dynamic content. However, you can still create a WARC file with the HTML content of the page, which can be sufficient for some use cases. Let's take a look at the example below:
import asyncio
import io
from pathlib import Path
from warcio.statusandheaders import StatusAndHeaders
from warcio.warcwriter import WARCWriter
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
def archive_response(context: ParselCrawlingContext, writer: WARCWriter) -> None:
"""Helper function for archiving response in WARC format."""
# Create WARC records for response
response_body = context.http_response.read()
response_payload_stream = io.BytesIO(response_body)
response_headers = StatusAndHeaders(
str(context.http_response.status_code),
context.http_response.headers,
protocol='HTTP/1.1',
)
response_record = writer.create_warc_record(
context.request.url,
'response',
payload=response_payload_stream,
length=len(response_body),
http_headers=response_headers,
)
writer.write_record(response_record)
async def main() -> None:
crawler = ParselCrawler(
max_requests_per_crawl=10,
)
# Create a WARC archive file a prepare the writer.
archive = Path('example.warc.gz')
with archive.open('wb') as output:
writer = WARCWriter(output, gzip=True)
# Create a WARC info record to store metadata about the archive.
warcinfo_payload = {
'software': 'Crawlee',
'format': 'WARC/1.1',
'description': 'Example archive created with ParselCrawler',
}
writer.write_record(writer.create_warcinfo_record(archive.name, warcinfo_payload))
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Archiving {context.request.url} ...')
archive_response(context=context, writer=writer)
await context.enqueue_links(strategy='same-domain')
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
The example above is calling an archiving function on each request using the request_handler
.
Browser-based crawlers
With browser crawlers such as PlaywrightCrawler
you should be able to create high fidelity archive of a web page. Let's take a look at the example below:
import asyncio
import io
import logging
from functools import partial
from pathlib import Path
from playwright.async_api import Request
from warcio.statusandheaders import StatusAndHeaders
from warcio.warcwriter import WARCWriter
from crawlee.crawlers import (
PlaywrightCrawler,
PlaywrightCrawlingContext,
PlaywrightPreNavCrawlingContext,
)
async def archive_response(
request: Request, writer: WARCWriter, logger: logging.Logger
) -> None:
"""Helper function for archiving response in WARC format."""
response = await request.response()
if not response:
logger.warning(f'Could not get response {request.url}')
return
try:
response_body = await response.body()
except Exception as e:
logger.warning(f'Could not get response body for {response.url}: {e}')
return
logger.info(f'Archiving resource {response.url}')
response_payload_stream = io.BytesIO(response_body)
response_headers = StatusAndHeaders(
str(response.status), response.headers, protocol='HTTP/1.1'
)
response_record = writer.create_warc_record(
response.url,
'response',
payload=response_payload_stream,
length=len(response_body),
http_headers=response_headers,
)
writer.write_record(response_record)
async def main() -> None:
crawler = PlaywrightCrawler(
max_requests_per_crawl=1,
headless=False,
)
# Create a WARC archive file a prepare the writer.
archive = Path('example.warc.gz')
with archive.open('wb') as output:
writer = WARCWriter(output, gzip=True)
# Create a WARC info record to store metadata about the archive.
warcinfo_payload = {
'software': 'Crawlee',
'format': 'WARC/1.1',
'description': 'Example archive created with PlaywrightCrawler',
}
writer.write_record(writer.create_warcinfo_record(archive.name, warcinfo_payload))
@crawler.pre_navigation_hook
async def archiving_hook(context: PlaywrightPreNavCrawlingContext) -> None:
# Ensure that all responses with additional resources are archived
context.page.on(
'requestfinished',
partial(archive_response, logger=context.log, writer=writer),
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# For some sites, where the content loads dynamically,
# it is needed to scroll the page to load all content.
# It slows down the crawling, but ensures that all content is loaded.
await context.infinite_scroll()
await context.enqueue_links(strategy='same-domain')
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
The example above is adding an archiving callback on each response in the pre_navigation archiving_hook
. This ensures that additional resources requested by the browser are also archived.
Using the archived data
In the following section, we will describe an example use case how you can use the recorded WARC files to speed up the development of your web crawler. The idea is to use the archived data as a source of responses for your crawler so that you can test it against the real data without having to crawl the web again.
It is assumed that you already have the WARC files. If not, please read the previous sections on how to create them first.
Let's use pywb again. This time we will not use it as a recording server, but as a proxy server that will serve the previously archived pages to your crawler in development.
wb-manager init example-collection
wb-manager add example-collection /your_path_to_warc_file/example.warc.gz
wayback --proxy example-collection
Previous commands start the wayback server that allows crawler requests to be served from the archived pages in the example-collection
instead of sending requests to the real website. This is again proxy mode of the wayback server, but without recording capability. Now you need to configure your crawler to use this proxy server, which was already described above. Once everything is finished, you can just run your crawler, and it will crawl the offline archived version of the website from your WARC file.
You can also manually browse the archived pages in the wayback server by going to the locally hosted server and entering the collection and URL of the archived page, for example: http://localhost:8080/example-collection/https:/crawlee.dev/
. The wayback server will serve the page from the WARC file if it exists, or it will return a 404 error if it does not. For more detail about the server please refer to the pywb documentation.
If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community.