Beautifulsoup4

Latest version: v4.12.3

Safety actively analyzes 628969 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 11 of 12

2.0.2

Added the unit tests in a separate module, and packaged it with
distutils.

Fixed a bug that sometimes caused renderContents() to return a Unicode
string even if there was no Unicode in the original string.

Added the done() method, which closes all of the parser's open
tags. It gets called automatically when you pass in some text to the
constructor of a parser class; otherwise you must call it yourself.

Reinstated some backwards compatibility with 1.x versions: referencing
the string member of a NavigableText object returns the NavigableText
object instead of throwing an error.

2.0.1

Fixed a bug that caused bad results when you tried to reference a tag
name shorter than 3 characters as a member of a Tag, eg. tag.table.td.

Made sure all Tags have the 'hidden' attribute so that an attempt to
access tag.hidden doesn't spawn an attempt to find a tag named
'hidden'.

Fixed a bug in the comparison operator.

2.0

This is the release to get if you want Python 1.5 compatibility.

The desired value of an attribute can now be any of the following:

* A string
* A string with SQL-style wildcards
* A compiled RE object
* A callable that returns None/false/empty string if the given value
doesn't match, and any other value otherwise.

This is much easier to use than SQL-style wildcards (see, regular
expressions are good for something). Because of this, I no longer
recommend you use SQL-style wildcards. They may go away in a future
release to clean up the code.

Made Beautiful Soup handle processing instructions as text instead of
ignoring them.

Applied patch from Richie Hindle (richie at entrian dot com) that
makes tag.string a shorthand for tag.contents[0].string when the tag
has only one string-owning child.

Added still more nestable tags. The nestable tags thing won't work in
a lot of cases and needs to be rethought.

Fixed an edge case where searching for "%foo" would match any string
shorter than "foo".

2.0.0

Beautiful Soup version 1 was very useful but also pretty stupid. I
originally wrote it without noticing any of the problems inherent in
trying to build a parse tree out of ambiguous HTML tags. This version
solves all of those problems to my satisfaction. It also adds many new
clever things to make up for the removal of the stupid things.

== Parsing ==

The parser logic has been greatly improved, and the BeautifulSoup
class should much more reliably yield a parse tree that looks like
what the page author intended. For a particular class of odd edge
cases that now causes problems, there is a new class,
ICantBelieveItsBeautifulSoup.

By default, Beautiful Soup now performs some cleanup operations on
text before parsing it. This is to avoid common problems with bad
definitions and self-closing tags that crash SGMLParser. You can
provide your own set of cleanup operations, or turn it off
altogether. The cleanup operations include fixing self-closing tags
that don't close, and replacing Microsoft smart quotes and similar
characters with their HTML entity equivalents.

You can now get a pretty-print version of parsed HTML to get a visual
picture of how Beautiful Soup parses it, with the Tag.prettify()
method.

== Strings and Unicode ==

There are separate NavigableText subclasses for ASCII and Unicode
strings. These classes directly subclass the corresponding base data
types. This means you can treat NavigableText objects as strings
instead of having to call methods on them to get the strings.

str() on a Tag always returns a string, and unicode() always returns
Unicode. Previously it was inconsistent.

== Tree traversal ==

In a first() or fetch() call, the tag name or the desired value of an
attribute can now be any of the following:

* A string (matches that specific tag or that specific attribute value)
* A list of strings (matches any tag or attribute value in the list)
* A compiled regular expression object (matches any tag or attribute
value that matches the regular expression)
* A callable object that takes the Tag object or attribute value as a
string. It returns None/false/empty string if the given string
doesn't match, and any other value if it does.

This is much easier to use than SQL-style wildcards (see, regular
expressions are good for something). Because of this, I took out
SQL-style wildcards. I'll put them back if someone complains, but
their removal simplifies the code a lot.

You can use fetch() and first() to search for text in the parse tree,
not just tags. There are new alias methods fetchText() and firstText()
designed for this purpose. As with searching for tags, you can pass in
a string, a regular expression object, or a method to match your text.

If you pass in something besides a map to the attrs argument of
fetch() or first(), Beautiful Soup will assume you want to match that
thing against the "class" attribute. When you're scraping
well-structured HTML, this makes your code a lot cleaner.

1.x and 2.x both let you call a Tag object as a shorthand for
fetch(). For instance, foo("bar") is a shorthand for
foo.fetch("bar"). In 2.x, you can also access a specially-named member
of a Tag object as a shorthand for first(). For instance, foo.barTag
is a shorthand for foo.first("bar"). By chaining these shortcuts you
traverse a tree in very little code: for header in
soup.bodyTag.pTag.tableTag('th'):

If an element relationship (like parent or next) doesn't apply to a
tag, it'll now show up Null instead of None. first() will also return
Null if you ask it for a nonexistent tag. Null is an object that's
just like None, except you can do whatever you want to it and it'll
give you Null instead of throwing an error.

This lets you do tree traversals like soup.htmlTag.headTag.titleTag
without having to worry if the intermediate stages are actually
there. Previously, if there was no 'head' tag in the document, headTag
in that instance would have been None, and accessing its 'titleTag'
member would have thrown an AttributeError. Now, you can get what you
want when it exists, and get Null when it doesn't, without having to
do a lot of conditionals checking to see if every stage is None.

There are two new relations between page elements: previousSibling and
nextSibling. They reference the previous and next element at the same
level of the parse tree. For instance, if you have HTML like this:

<p><ul><li>Foo<br /><li>Bar</ul>

The first 'li' tag has a previousSibling of Null and its nextSibling
is the second 'li' tag. The second 'li' tag has a nextSibling of Null
and its previousSibling is the first 'li' tag. The previousSibling of
the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
'br' tag.

I took out the ability to use fetch() to find tags that have a
specific list of contents. See, I can't even explain it well. It was
really difficult to use, I never used it, and I don't think anyone
else ever used it. To the extent anyone did, they can probably use
fetchText() instead. If it turns out someone needs it I'll think of
another solution.

== Tree manipulation ==

You can add new attributes to a tag, and delete attributes from a
tag. In 1.x you could only change a tag's existing attributes.

== Porting Considerations ==

There are three changes in 2.0 that break old code:

In the post-1.2 release you could pass in a function into fetch(). The
function took a string, the tag name. In 2.0, the function takes the
actual Tag object.

It's no longer to pass in SQL-style wildcards to fetch(). Use a
regular expression instead.

The different parsing algorithm means the parse tree may not be shaped
like you expect. This will only actually affect you if your code uses
one of the affected parts. I haven't run into this problem yet while
porting my code.

1.2

Applied patch from Ben Last (ben at benlast dot com) that made
Tag.renderContents() correctly handle Unicode.

Made BeautifulStoneSoup even dumber by making it not implicitly close
a tag when another tag of the same type is encountered; only when an
actual closing tag is encountered. This change courtesy of Fuzzy (mike
at pcblokes dot com). BeautifulSoup still works as before.

1.1

Added more 'nestable' tags. Changed popping semantics so that when a
nestable tag is encountered, tags are popped up to the previously
encountered nestable tag (of whatever kind). I will revert this if
enough people complain, but it should make more people's lives easier
than harder. This enhancement was suggested by Anthony Baxter (anthony
at interlink dot com dot au).

Page 11 of 12

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.