Beautifulsoup4

Latest version: v4.12.3

Safety actively analyzes 628969 Python packages for vulnerabilities to keep your Python projects secure.

Page 8 of 12

4.0.0b5

* Rationalized Beautiful Soup's treatment of CSS class. A tag
belonging to multiple CSS classes is treated as having a list of
values for the 'class' attribute. Searching for a CSS class will
match *any* of the CSS classes.

This actually affects all attributes that the HTML standard defines
as taking multiple values (class, rel, rev, archive, accept-charset,
and headers), but 'class' is by far the most common. [bug=41034]

* If you pass anything other than a dictionary as the second argument
to one of the find* methods, it'll assume you want to use that
object to search against a tag's CSS classes. Previously this only
worked if you passed in a string.

* Fixed a bug that caused a crash when you passed a dictionary as an
attribute value (possibly because you mistyped "attrs"). [bug=842419]

* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
like <meta charset="utf-8" />. [bug=837268]

* If Unicode, Dammit can't figure out a consistent encoding for a
page, it will try each of its guesses again, with errors="replace"
instead of errors="strict". This may mean that some data gets
replaced with REPLACEMENT CHARACTER, but at least most of it will
get turned into Unicode. [bug=754903]

* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
on certain kinds of markup. [bug=838800]

* Fixed a bug that wrecked the tree if you replaced an element with an
empty string. [bug=728697]

* Improved Unicode, Dammit's behavior when you give it Unicode to
begin with.

4.0.0b4

* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()

* BeautifulSoup.new_tag() will follow the rules of whatever
tree-builder was used to create the original BeautifulSoup object. A
new <p> tag will look like "<p />" if the soup object was created to
parse XML, but it will look like "<p></p>" if the soup object was
created to parse HTML.

* We pass in strict=False to html.parser on Python 3, greatly
improving html.parser's ability to handle bad HTML.

* We also monkeypatch a serious bug in html.parser that made
strict=False disastrous on Python 3.2.2.

* Replaced the "substitute_html_entities" argument with the
more general "formatter" argument.

* Bare ampersands and angle brackets are always converted to XML
entities unless the user prevents it.

* Added PageElement.insert_before() and PageElement.insert_after(),
which let you put an element into the parse tree with respect to
some other element.

* Raise an exception when the user tries to do something nonsensical
like insert a tag into itself.

4.0.0b3

Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
Soup's custom HTML parser in favor of a system that lets you write a
little glue code and plug in any HTML or XML parser you want.

Beautiful Soup 4.0 comes with glue code for four parsers:

* Python's standard HTMLParser (html.parser in Python 3)
* lxml's HTML and XML parsers
* html5lib's HTML parser

HTMLParser is the default, but I recommend you install lxml if you
can.

For complete documentation, see the Sphinx documentation in
bs4/doc/source/. What follows is a summary of the changes from
Beautiful Soup 3.

=== The module name has changed ===

Previously you imported the BeautifulSoup class from a module also
called BeautifulSoup. To save keystrokes and make it clear which
version of the API is in use, the module is now called 'bs4':

>>> from bs4 import BeautifulSoup

=== It works with Python 3 ===

Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don't sacrifice
quality.

Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
support to the finish line. Ezio Melotti is also to thank for greatly
improving the HTML parser that comes with Python 3.2.

=== CDATA sections are normal text, if they're understood at all. ===

Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:

<p><![CDATA[foo]]></p> => <p></p>

A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like <svg> and <math>:

<svg><![CDATA[foo]]></svg> => <p>foo</p>

The default XML parser (which uses lxml behind the scenes) turns CDATA
sections into ordinary text elements:

<p><![CDATA[foo]]></p> => <p>foo</p>

In theory it's possible to preserve the CDATA sections when using the
XML parser, but I don't see how to get it to work in practice.

=== Miscellaneous other stuff ===

If the BeautifulSoup instance has .is_xml set to True, an appropriate
XML declaration will be emitted when the tree is transformed into a
string:

<?xml version="1.0" encoding="utf-8">
<markup>
...
</markup>

The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
builders set it to False. If you want to parse XHTML with an HTML
parser, you can set it manually.

3.2

to make it obvious which one you should use.

3.2.0

3.1.0

A hybrid version that supports 2.4 and can be automatically converted
to run under Python 3.0. There are three backwards-incompatible
changes you should be aware of, but no new features or deliberate
behavior changes.

1. str() may no longer do what you want. This is because the meaning
of str() inverts between Python 2 and 3; in Python 2 it gives you a
byte string, in Python 3 it gives you a Unicode string.

The effect of this is that you can't pass an encoding to .__str__
anymore. Use encode() to get a string and decode() to get Unicode, and
you'll be ready (well, readier) for Python 3.

2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
which is gone in Python 3. There's some bad HTML that SGMLParser
handled but HTMLParser doesn't, usually to do with attribute values
that aren't closed or have brackets inside them:

<a href="foo</a>, </a><a href="bar">baz</a>
<a b="<a>">', '<a b="<a>"></a><a>"></a>

A later version of Beautiful Soup will allow you to plug in different
parsers to make tradeoffs between speed and the ability to handle bad
HTML.

3. In Python 3 (but not Python 2), HTMLParser converts entities within
attributes to the corresponding Unicode characters. In Python 2 it's
possible to parse this string and leave the é intact.

<a href="http://crummy.com?sacré&bleu">

In Python 3, the é is always converted to \xe9 during
parsing.

Page 8 of 12

Releases

Has known vulnerabilities

Previous Next

Beautifulsoup4

Page 8 of 12

4.0.0b5

4.0.0b4

4.0.0b3

3.2

3.2.0

3.1.0

Page 8 of 12

Links

Releases