htmlToText
Callable
Parameters
htmlOrCheerioElement: string | CheerioAPI
HTML text or parsed HTML represented using a cheerio function.
Returns string
Plain text
Crawlee for Python is open to early adopters! 🥳️
HTML text or parsed HTML represented using a cheerio function.
Plain text
The function converts a HTML document to a plain text.
The plain text generated by the function is similar to a text captured by pressing Ctrl+A and Ctrl+C on a page when loaded in a web browser. The function doesn't aspire to preserve the formatting or to be perfectly correct with respect to HTML specifications. However, it attempts to generate newlines and whitespaces in and around HTML elements to avoid merging distinct parts of text and thus enable extraction of data from the text (e.g. phone numbers).
Example usage
Note that the function uses cheerio to parse the HTML. Optionally, to avoid duplicate parsing of HTML and thus improve performance, you can pass an existing Cheerio object to the function instead of the HTML text. The HTML should be parsed with the
decodeEntities
option set totrue
. For example: