Finding security vulnerabilities in third party packages
Around a year and a half ago, we started building up a security database
for third party Python packages. This allows us to give users fine grained
control about what kind of dependency updates they want to receive. Additionally,
we are flagging security related updates to give them more visibility and urgency.
The database is now fully integrated into the online service for over a year
and there’s the standalone Safety command line utility available.
Let's take a look at how we got started building up the database and how
we ensure to keep it up-to-date.
But what about CVE?
Before we begin, we need to take a brief look at the CVE project and why
it's largely unusable for third party Python packages.
CVE stands for Common Vulnerabilities and Exposures. It is a program
launched around 20 years ago to catalog security vulnerabilities in software.
When it comes to end products and/or critical infrastructure libraries,
this works great. For small to mid sized language specific libraries, not so much.
There are 3 problems
1) In the majority of cases developers simply don't bother to go trough
the process of getting a CVE id assigned. Most of the libraries (even
larger ones) have 1-3 developers actively working on it. If they find a
security issue, they fix it, write a changelog entry (sometimes) and move on to
the next issue.
2) Security researchers oftentimes focus on widely used projects. The
underlying issue might be a third party package, but the larger software project
gets the CVE id assigned and thus hides this critical piece of information.
3) And finally, another common problem is name mapping. A lot of entries in the CVE catalog
come from Linux distributions that use slightly different names for their
packages than the original package manager. This is a negligible issue but
a point that sometimes leads to confusion nevertheless.
Building the database and keeping it up-to-date
While started building the database, the obvious first source of information was the CVE catalog.
As pointed out above, this didn't work out so great. For more than 700 security
vulnerabilities currently tracked, there are only around 200 that have a CVE id assigned.
A lot of them weren't discovered through the CVE catalog directly, though. The vulnerability was
discovered through a different source and the CVE mapping was added later on.
The best source to learn about security vulnerabilities is where developers work on their code.
As part of the dependency update service we are constantly monitoring changelogs which are attached to
pull requests. In hindsight, that would have been the perfect place to get started. Proper changelog
coverage was at around 20% back then, but it turned out to be an invaluable source to build up the database.
Today, this is still the main source to learn about new security vulnerabilities. Overall changelog coverage
is now up to 50% with more than 80% of top 300 packages. Additionally, we are monitoring commit logs and
the issue tracker on GitHub. This ensures that we don't miss anything security related going on around the project.
In order to make sure that the database is always up-to-date, the system to flag an entry as security related is
hooked directly into the update service running on pyup.io. If a new release is uploaded to PyPi it usually
takes around 5 minutes until a worker process is started that goes through all new change and commit logs and
starts filtering them for security related keywords.
If something security related is found, the entry is flagged and a new review task is created (see screenshot below).
Additionally, a mail with a short preview is sent out to make sure that super critical vulnerabilities can be handled
immediately. This turned out to be the best solution because quickly going over an email is not too distracting and
super responsive at the same time.
Unfortunately, there’s still a lot of manual work involved. There are lots and lots of false positives that need to
be reviewed by hand. At the time of this writing, the database has around 700 entries while the internal review panel
counts more than 35000 flagged entries. That’s a hit rate of around 2%.
On top of that, we are currently working on an internal buildserver that runs various security tools against
all new package releases on PyPi. This is still being tested internally but might be an option for the future. Our
plan is to open source the project and the dataset sometime later this year.
All in all, we are pretty happy with the result so far. Despite its relatively young age it turned out to be a
valuable resource for the whole Python community. Apart from the tools and services pyup.io has, the database is now
being used by pipenv and a couple of other security related tools and startups as
It was a serious effort to build the database up from the ground to get it in the shape it is today and it's great
to see it used by such a huge audience.
In terms of money to run the project, it is still largely subsidized by private organizations paying for a pyup.io plan.
Our long term goal is to make it an independent project that carries itself. If you want to help with that, grab an API
key that comes with a paid pyup.io account or tell your CTO to get one for your company.
Try it out
The easiest way to check your projects dependencies for known security vulnerabilities is the
Safety command line tool. Ideally as part of your CI pipeline.
If you are using pipenv, try out the
pipenv check command.