mrjob Changelog

0.7.4

* library requirement changes:
* [emr] requires boto3>=1.10.0, botocore>=1.13.26 (2193)
* [google] requires google-cloud-dataproc<=1.1.0
* cloud runners (Dataproc, EMR):
* mrjob is now bootstrapped through py_files, not at bootstrap time
* EMR Runner:
* default image_version is now 6.0.0
* support Docker on 6.x AMIs (2179)
* added docker_client_config, docker_image, docker_mounts opts
* allow concurrent steps on EMR clusters (2185)
* max_concurrent_steps option
* for multi-step jobs, can add steps to cluster one at a time
* by default, does this if cluster supports concurrent steps
* can be controlled directly with add_steps_in_batch option
* pooling:
* join pooled clusters based on YARN cluster metrics (2191)
* min_available_mb, min_available_virtual_cores opts
* upgrades to timing and cluster management:
* max_clusters_in_pool option (2192)
* pool_timeout_minutes (2199)
* pool_jitter_seconds to prevent race conditions (2200)
* wait for S3 sync after uploading to S3, not before launching cluster
* don't wait pool_wait_minutes if no clusters to wait for (2198)
* get_job_steps() is deprecated

0.7.3

* cluster pooling changes:
* clusters locking now uses EMR tags, not S3 objects (2160)
* cluster locks always expire after one minute (2162)
* deprecated --max-mins-locked (terminate-idle-clusters), does nothing
* pooling uses API more efficiently
* most cluster pooling info is in job name (2160)
* don't list pooled clusters' steps (2159)
* use any matching cluster, not just the "best" one (2164)
* "best" cluster determined by NormalizedInstanceHours / hours run
* matching rules are slightly more strict:
* mrjob version must always match
* application list must match exactly
* terminate_idle_clusters no longer locks pooled clusters
* spark runner:
* counters work when spark_tmp_dir is a local path (2176)
* manifest download script correctly handles errors with dash (2175)

0.7.2

* archives work on non-YARN Spark installations (1993)
* mrjob.util.file_ext() ignores initial dots
* archives in setup scripts etc. are auto-named without file extension
* bootstrap now recognizes archives with names like *.0.7.tar.gz
* don't copy SSH key to master node when accessing other nodes on EMR (1209)
* added ssh_add_bin option
* extra_cluster_params merges dict params rather than overwriting them (2154)
* default python_bin on Python 2 is now 'python2.7' (2151)
* ensure working PyYAML installs on Python 3.4 (2149)

0.7.1

* enable mrjob to show invoked runner with kwargs (2129)
* set default value of VisibleToAllUsers to true (2131)
* added archives to EMR pool hash during bootstrapping (2136)

0.7.0

* moved support for AWS and Google Cloud to extras_require (1935)
* use e.g. `pip install mrjob[aws]`
* removed support for non-Python MRJobs (2087)
* removed interpreter and steps_interpreter options (see below)
* removed the `mrjob run` command
* removed mr_wc.rb from mrjob/examples/
* merged the MRJobLauncher class back into MRJob
* MRJob classes initialized without args read them from sys.argv (2124)
* use SomeMRJob([]) to simulate running with no args (e.g. for tests)
* revamped and tested mrjob/examples/ (2122)
* mr_grep.py no longer errors on no matches
* mr_log_sampler.py correctly randomizes lines
* mr_spark_wordcount.py is no longer case sensitive
* same with mr_spark_wordcount_script.py
* mr_text_classifier.py now reads text files directly, no need to encode
* public domain examples are in mrjob/examples/docs-to-classify
* renamed mr_words_containing_u_freq_count.py
* removed some examples that were difficult to test or maintain
* mrjob audit-emr-usage no longer reads pre-v0.6.0 cluster pool names (1815)
* filesystem methods now have consistent arg naming
* removed the following deprecated code:
* runner options:
* emr_api_params
* interpreter
* max_hours_idle
* mins_to_end_of_hour
* steps_interpreter
* steps_python_bin
* visible_to_all_users
* singular switches (use --archives, etc.):
* --archive
* --dir
* --file
* --hadoop-arg
* --libjar
* --py-file
* --spark-arg
* --steps switch from MRJobs (2046)
* use --help -v to see help for --mapper etc.
* MRJob:
* optparse simulation:
* add_file_option()
* add_passthrough_option()
* configure_options()
* load_options()
* pass_through_option()
* self.args
* self.OPTION_CLASS
* parse_output_line()
* MRJobRunner:
* file_upload_args kwarg to runner constructor
* stream_output()
* mrjob.util:
* parse_and_save_options()
* read_file()
* read_input()
* filesystems:
* arguments to CompositeFilesystem constructor (use add_fs())
* useless local_tmp_dir arg to GCSFilesystem constructor
* chunk_size arg to GCSFilesystem.put()

0.6.12

* default image_version on Dataproc is now 1.3 (2110)
* local filesystem can now handle file:// URIs (1986)
* sim runners accept file:// URIs as input files, upload files/archives

Mrjob

Page 1 of 10