Changelogs » Baleen

PyUp Safety actively tracks 232,000 Python packages for vulnerabilities and notifies you when to upgrade.



Extended the Baleen export functionality to dump either an HTML or JSON corpus to disk in a suitable format for NLP analysis, particularly using NLTK. The new export functionality is still single process, but does some smart things to reduce the amount of time the export takes, as well as the amount of memory required. Additionally, we have improved the visual interface to the web application, making status messages more noticeable as we monitor continued data ingestion.
  The app can be found online at [](
  **Deployed**: Monday, April 18, 2016
  **Contributors**:  Benjamin Bengfort, Sasan Bahadaran
  - Updated the exporter to use no_dereferencing and no_cache
  - Updated the exporter to write out a json meta file of feeds
  - Exporter can now export in either JSON (default) or HTML
  - Exporter is now memory and query optimized as good as we can get it
  - No HTML sanitization occurs in the exporter any more
  - Added a bit of colorization to the web app status page for quick duration identification
  - Added iconography to the feeds and status page for better visualization
  - Better datetime formatting for the timezone and understandability
  - Inclusion of the humanize package for timesince and intcomma readability
  - Made the status page responsive


Some changes to the web application to attempt to solve SEGFAULT errors and to make the status and the logs more readable. This is just a quick hotfix to make sure we have decent monitoring in the app.
  The app can be found online at [](
  **Deployed**: Wednesday, April 13, 2016
  **Contributors**:  Benjamin Bengfort
  - Bootstrapified the status page
  - Added a job history listing to status page
  - Added a duration computation to the Job model
  - Created a mongoengine Log model
  - Added in a helper for flask.ext.MongoEngine to make db connections better
  - Removed log file reading and now read from the database
  - Added in Flask humanize for better visibility in the status page


Very happy to have had [lauralorenz]( and [bahadasx]( contribute to Baleen by building a web admin app. The app is a very simple Flask app that reads from the database and reports on the status, including the list of available feeds. It also reports information from the log file.
  The app can be found online now at [](
  **Deployed**: Thursday, April 7, 2016
  **Contributors**:  Sasan Bahadaran, Laura Lorenz, and Benjamin Bengfort
  - Created a Docker configuration and setup for easier development
  - Improved the export functionality for a quick corpus
  - Created a Flask web application for Baleen administration
  - Added a feeds listing page to quickly see what feeds are being ingested
  - Added a job status page that reports on the current Baleen status.
  - Add a log file reader to inspect what's going on in the log file.
  - Added boostrap and baleen integration
  - Created a serve sub command for Baleen for easy management
  - Created a deployment method with uWSGI + Nginx


Releases one day after another! The reason is because Baleen needs to be running in production to gather a large enough corpus for PyCon. Version 0.3 is a big release that implements the revised component architecture. It should hopefully be more stable, give more visibility into what's going on, be easier to update and fix, and have a few more features. Features include tracking ingestion jobs in the Mongo database (so we can add a web application), synchronization of feeds and wrangling of posts are not coupled. Added Commis for easier console utility management, and finally added some other tools and tests.
  **Deployed**: Thursday, March 3, 2016
  **Contributors**:  Benjamin Bengfort
  - Added the [Commis]( library for our new console utility which gives us more flexibility on the application.
  - Added a feed synchronization utility that decouples the feedparser interaction from anything but a feed object.
  - Added a new decorators library inspired from previous libraries
  - Added a reraise decorator that wraps exceptions and makes them Baleen exceptions
  - Added _plenty_ of tests for various modules
  - Added a post wrangling method that decouples the post interaction and web fetch from anything but a post object.
  - Created a better `info` command with more information about the app
  - Modified the ingest and run commands to be a bit more stable
  - Created a Job model for saving information about each ingestion run for application views
  - Did I say more tests?


Hotfix for an error that caused unicode strings to kill the ingestion in a try/except block (as it was being written to the logger)! This error was so serious it needed to be fixed right away, even in the middle of Version 0.3 updates.
  **Deployed**: Wednesday, March 2, 2016
  **Contributors**:  Benjamin Bengfort
  - Eliminated the traceback capture from the baleen console utility
  - Fixed the unicode decode exception in the error logging try/except block
  - Added some stability measures


This update was a push to get Baleen running on EC2 on an hourly basis in preparation for PyCon. We updated all of Baleen's dependencies to their latest versions, added tests and other important fixtures, and organized the code a bit better. New functionality includes the ability to fetch the post webpage from the link, export the corpus to disk using the command line utility, and run in the background using the schedule library.
  **Deployed**: Tuesday, March 1, 2016
  **Contributors**:  Benjamin Bengfort
  - Refactoring of the code to a more organized structure
  - Added some tests for safety on a number of modules
  - Updated all the dependencies from 2014
  - Added an export command to the CLI
  - Uses to fetch the full webpage from the link
  - Slightly better logging configuration
  - Use schedule to run every hour
  - Created Upstart configuration for background on Ubuntu


This was the initial version of Baleen before the revamp occurred thanks to the PyCon tutorial. Baleen in this form was a command line utility that fetched RSS feeds on demand and stored them in a Mongo database. The input to Baleen is an OPML file that contains an RSS feed listing as well as their topics.
  Baleen was originally used to produce a corpus for the [District Data Labs]( _NLP with NLTK_ course. The corpus was then adapted for use in the []( online course of the same name. The problem is that because Baleen had to be ran manually, it was difficult to get a high quality corpus on demand.
  **Release**: Tuesday, September 23, 2014
  **Deployed**: Thursday, February 18, 2016
  **Contributor**: Benjamin Bengfort
  - CLI Program to import OPML files and kick off ingestion
  - OPML parser to read RSS feeds
  - Ingestion module to download and parse RSS using feedburner
  - Logging module for better information about ingestion
  - Mongo database integration