// posts

Finding security vulnerabilities in third party packages

Around a year and a half ago, we started building up a security database for third party Python packages. This allows us to give users fine grained control about what kind of dependency updates they want to receive. Additionally, we are flagging security related updates to give them more visibility and urgency.

The database is now fully integrated into the online service for over a year and there’s the standalone Safety command line utility available.

Let's take a look at how we got started building up the database and how we ensure to keep it up-to-date.

But what about CVE?

Before we begin, we need to take a brief look at the CVE project and why it's largely unusable for third party Python packages.

CVE stands for Common Vulnerabilities and Exposures. It is a program launched around 20 years ago to catalog security vulnerabilities in software. When it comes to end products and/or critical infrastructure libraries, this works great. For small to mid sized language specific libraries, not so much.

There are 3 problems

1) In the majority of cases developers simply don't bother to go trough the process of getting a CVE id assigned. Most of the libraries (even larger ones) have 1-3 developers actively working on it. If they find a security issue, they fix it, write a changelog entry (sometimes) and move on to the next issue.

2) Security researchers oftentimes focus on widely used projects. The underlying issue might be a third party package, but the larger software project gets the CVE id assigned and thus hides this critical piece of information.

3) And finally, another common problem is name mapping. A lot of entries in the CVE catalog come from Linux distributions that use slightly different names for their packages than the original package manager. This is a negligible issue but a point that sometimes leads to confusion nevertheless.

Building the database and keeping it up-to-date

While started building the database, the obvious first source of information was the CVE catalog. As pointed out above, this didn't work out so great. For more than 700 security vulnerabilities currently tracked, there are only around 200 that have a CVE id assigned. A lot of them weren't discovered through the CVE catalog directly, though. The vulnerability was discovered through a different source and the CVE mapping was added later on.

The best source to learn about security vulnerabilities is where developers work on their code. As part of the dependency update service we are constantly monitoring changelogs which are attached to pull requests. In hindsight, that would have been the perfect place to get started. Proper changelog coverage was at around 20% back then, but it turned out to be an invaluable source to build up the database.

Today, this is still the main source to learn about new security vulnerabilities. Overall changelog coverage is now up to 50% with more than 80% of top 300 packages. Additionally, we are monitoring commit logs and the issue tracker on GitHub. This ensures that we don't miss anything security related going on around the project.

In order to make sure that the database is always up-to-date, the system to flag an entry as security related is hooked directly into the update service running on pyup.io. If a new release is uploaded to PyPi it usually takes around 5 minutes until a worker process is started that goes through all new change and commit logs and starts filtering them for security related keywords.

If something security related is found, the entry is flagged and a new review task is created (see screenshot below). Additionally, a mail with a short preview is sent out to make sure that super critical vulnerabilities can be handled immediately. This turned out to be the best solution because quickly going over an email is not too distracting and super responsive at the same time.

screenshot

Unfortunately, there’s still a lot of manual work involved. There are lots and lots of false positives that need to be reviewed by hand. At the time of this writing, the database has around 700 entries while the internal review panel counts more than 35000 flagged entries. That’s a hit rate of around 2%.

On top of that, we are currently working on an internal buildserver that runs various security tools against all new package releases on PyPi. This is still being tested internally but might be an option for the future. Our plan is to open source the project and the dataset sometime later this year.

Summary

All in all, we are pretty happy with the result so far. Despite its relatively young age it turned out to be a valuable resource for the whole Python community. Apart from the tools and services pyup.io has, the database is now being used by pipenv and a couple of other security related tools and startups as their datasource.
It was a serious effort to build the database up from the ground to get it in the shape it is today and it's great to see it used by such a huge audience.

In terms of money to run the project, it is still largely subsidized by private organizations paying for a pyup.io plan. Our long term goal is to make it an independent project that carries itself. If you want to help with that, grab an API key that comes with a paid pyup.io account or tell your CTO to get one for your company.

Try it out

The easiest way to check your projects dependencies for known security vulnerabilities is the Safety command line tool. Ideally as part of your CI pipeline.

If you are using pipenv, try out the pipenv check command.