Changelogs » Backend.ai-agent

Backend.ai-agent

19.03.0b7

----------------------

- Make logs and error messages to have more details.

- Implement RW/RO permissions when mounting vfolders (lablup/backend.ai-manager82)

- Change statistics collector to use UNIX domain socketes, for specific environments
where locally bound sockets are not accessible via network-local IP addresses.

- Update Alpine-based kernel runners with a fix for uid-match functionality for them.

- Fix some bugs related to allocation maps and ImageRef class.

19.03.0b6

----------------------

- NEW: Jupyter notebooks now have our Backend.AI logo and a slightly customized look.

- Fix the jupyter notebook service-port to work with conda-based images,
where "python -m jupyter notebook" does not work but "python -m notebook"
works.

- Let agent fail early and cleanly if there is an initialization error,
for ease of debugging with supervisord.

- Fix restoration of resource allocation maps upon agent restarts.

19.03.0b5

----------------------

- Handle failures of accelerator plugin initialization more gracefully.

19.03.0b4

----------------------

- Fix duplicate resource allocation when a computedevice plugin defines
multiple resource slots.

- Fix handling multiple sets of docker container configuration arguments
generated by different compute device plugins.

19.03.0b3

----------------------

- Restore support for fractionally scaled accelerators and a reusable
FractionAllocMap class for them.

- Fix a bug after automatically pull-updating kernel images from registries.

- Fix heartbeat serialization error.

19.03.0b2

----------------------

- Add missing implementation for authenticated image pulls from private docker
registries.

19.03.0b1

----------------------

- BIG: Support dynamic resource slots and full private Docker registries. (98)

- Expand support for various kernel environments: Python 2, R, Julia, JupyterHub

19.03.0a3

----------------------

- Replace "--skip-jail" option with "--sandbox-type", which now defaults to use
Docker-provided sandboxing until we get our jail stabilized.

19.03.0a2

----------------------

- Fix missing stderr outputs in the query mode.  Now standard Python exception logs
may contain ANSI color codes as ``jupyter_client`` automatically highlights them.
(93)

19.03.0a1

----------------------

- NEW: Rewrite the kernel image specification.  Now it is much easier to build
your own kernel image by adding just a few more labels in Dockerfiles.
(ref: https://github.com/lablup/backend.ai-kernels/howto-adding-a-new-image)

- We now support official NVIDIA GPU Cloud images in this way.

- We are now able to support Python 2.x kernels again!

- Now agent/kernel-runner/jail/hook are all managed together and the kernel
images are completely separated from their changes.

- NEW: New command-line options

- ``--skip-jail``: disables our jail and falls back to the Docker's default seccomp
filter.  Useful for troubleshotting with our jail.

- ``--jail-arg``: when using our jail, add extra command-line arguments to the jail
by specifying this option multiple times.
Note that options starting with dash must be prepended with an extra space to
avoid parsing issues imposed by the Python's standard argparse module.

- ``--kernel-uid``: when the agent is executed as root, use this to make the kernel
containers to run as specific user/UID.

- ``--scratch-in-memory``: moves the scratch and /tmp directories into a separate
in-memory filesystem (tmpfs) to avoid inode/quota exahustion issues in
multi-tenant setups.

This option is only available at Linux and the agent must be run as root. When
used, the size of each directory is limited to 64 MiB. (In the future this will
become configurable.)

- CHANGE: The kernel runner now preserves container-defined environment variables.

18.12.1

--------------------

- Technical release to fix a packaging mistake in 18.12.0.

18.12.0

--------------------

- Version numbers now follow year.month releases like Docker.
We plan to release stable versions on every 3 months (e.g., 18.12, 19.03, ...).

- NEW: Support TPU (Tensor Processing Units) on Google Clouds.

- Clean up log messages for on-premise devops & IT admins.

18.12.0a4

----------------------

- NEW: Support specifying credentials for private Docker registries.

- CHANGE: Now it prefers etcd-based docker registry configs over CLI arguments.

18.12.0a3

----------------------

- Technical release to fix the backend.ai-common dependency version.

18.12.0a2

----------------------

- NEW: Support user-specified ranges for the service ports published by containers
via the ``--container-port-range`` CLI argument for firewall-sensitive setups.
(The default range is 30000-31000) (90)

- CHANGE: The agent now automatically pulls the image if not available in the host.

- CHANGE: The process monitoring tools will now show prettified process names for
Backend.AI's daemon processes which exhibit the role and key configurations (e.g.,
namespace) at a glance.

- Improve support for using custom/private Docker registries.

18.12.0a1

----------------------

- NEW: App service ports!  You can start a compute session and directly connect to a
service running inside it, such as Jupyter Notebook! (89)

- Internal refactoring to clean up and fix bugs related to image name references.

- Fix bugs in statistics collection.

- Monitoring tools are separated as plugins.

1.4.0

------------------

- Generalizes accelerator supports

- Accelerators such as CUDA GPUs can be installed as a separate plugin (66)

- Adds support for nvidia-docker v2 (64)

- Adds support for allocation of multiple accelerators for one kernel container as
well as partial shares of each accelerator (66)

- Revamp the agent restart and kernel initialization processes (35, 73)

- The view of the agent can be limited to specific CPU cores and GPUs
using extra CLI arguments: ``--limit-cpus``, ``--limit-gpus`` for
debugging and performance benchmarks. (65)

1.3.7

------------------

- Hotfix for handling of dotted image names when they are terminated.

1.3.6

------------------

- Hotfix for handling subdirectories in batch-mode file uploads.

1.3.5

------------------

- Fix vfolder mounts to use the configuration specified in the etcd.
(No more fixed to "/mnt"!)

1.3.4

------------------

- Fix occasional KeyError when destroying kernels. (56)

- Deploy a debug log for occasional FileNotFoundError when uploading files
in the batch mode. (57)

1.3.3

------------------

- Fix wrong kernel_host sent back to the manager when not overridden.

1.3.2

------------------

- Technical release to fix backend.ai-common depedency version.

1.3.1

------------------

- Technical release to update CI configuration.

1.3.0

------------------

- Fix repeating docker event polling even when there is connection/client-side
aiohttp errors.

- Upgrade aiohttp to v3.0 release.

- Improve dockerization. (55)

- Improve inner beauty.

1.2.0

------------------

**NOTICE**

- From this release, the manager and agent versions will go together, which indicates
the compatibility of them, even when either one has relatively little improvements.

**CHANGES**

- Include the exit code of the last executed in-kernel process when returning
``build-finished`` or ``finished`` results in the batch mode.

- Improve logging to support rotating file-based logs.

- Upgrade aiotools to v0.5.2 release.

- Remove the image name prefix when reporting available images. (51)

- Improve debug-kernel mode to mount host-side kernel runner source into the kernel
containers so that they use the latest, editable source clone of the kernel runner.

1.1.0

------------------

- Automatically assign the run ID if set None when starting a run.

- Pass environment variables in the start-config to the kernels via
``/home/work/.config/environ.txt`` file mounted inside kernels.

- Include the list of kernel images available to the agent when sending
heartbeats. (51)

- Remove simplejson from dependencies in favor of the standard library.
The stdlib has been updated to support all required features and use
an internal C-based module for performance.

1.0.6

------------------

- Update aioredis to v1.0.0 release.

- Remove "mode" argument from completion RPC calls.

- Fix a bug when terminating overlapped execute streams, which has caused
indefinite hangs in the client side due to missing "finished" notification.

1.0.5

------------------

- Implement virtual folder mounting (assuming /mnt is already configured)

1.0.4

------------------

- Fix synchronization issues when restarting kernels

- Improve "debug-kernel" mode to use the given kernel name

1.0.3

------------------

- Fix a bug in duplicate-check of our Docker event stream monitoring coroutine

1.0.2

------------------

- Fix automatic mounting of deeplearning-samples Docker volume for ML kernels

- Stabilize statistics collection

- Fix typos

1.0.1

------------------

- Prevent duplicate Docker event generation

- Various bug fixes and improvements (44, 45, 46, 47)

1.0.0

------------------

- This release is replaced with v1.0.1 due to many bugs.

**CHANGES**

- Rename the package to "Backend.AI" and the import path to ``ai.backend.agent``

- Rewrite interaction with the manager

- Read configuration from etcd shared with the manager

- Add FIFO-style scheduling of overlapped execution requests

- Implement I/O and network statistic collection using sysfs

0.9.14

-------------------

**FIX**

- Fix and improve version reference mechanisms.

- Fix missing import error vanished during hostfix cherrypick

0.9.12

-------------------

**IMPROVEMENTS**

- It now applies the same UID to the spawned containers if they have the "uid-match"
feature label flag. (backported from develop)

0.9.11

-------------------

**FIX**

- Add missing "sorna-common" dependency and update other requirements.

0.9.10

-------------------

**FIX**

- Fix the wrong version range of an optional depedency package "datadog"

0.9.9

------------------

**IMPROVEMENTS**

- Improve packaging so that setup.py has the source list of dependencies
whereas requirements.txt has additional/local versions from exotic
sources.

- Support exception/event logging with Sentry and runtime statistics with Datadog.

0.9.8

------------------

**FIX**

- Fix interactive user inputs in the batch-mode execution.

0.9.7

------------------

**NEW**

- Add support for the batch-mode API with compiled languages such as
C/C++/Java/Rust.

- Add support for the file upload API for use with the batch-mode API.
(up to 20 files per request and 1 MiB per each file)

**CHANGES**

- Only files stored in "/home/work.output" directories of kernel containers
are auto-uploaded to S3 as downloadable files, as now we rely on our
dedicated multi-media output interfaces to show plots and other graphics.
Previously, all non-hidden files in "/home/work" were uploaded.

0.9.6

------------------

- Fix a regression in console output streaming.

0.9.5

------------------

- Add PyTorch support.

- Upgrade aiohttp to v2 and relevant dependencies as well.

0.9.4

------------------

- Update missing long_description.

0.9.3

------------------

- Improve packaging: auto-converted README.md as long description and unified
requirements.txt and setup.py dependencies.

0.9.2

------------------

- Fix sorna-common requirement version.

0.9.1

------------------

**CHANGES**

- Separate console output formats for API v1 and v2.

- Deprecate unused matching option for execution API.

- Remove control messages in API responses.

0.9.0

------------------

**NEW**

- PUSH/PULL-based kernel interaction protocol to support streaming outputs.
This enables interactive input functions and streaming outputs for long-running codes,
and also makes kernel execution more resilient to network failures.
(ZeroMQ's REQ/REP sockets break the system if any messages get dropped)

0.8.2

------------------

**FIXES**

- Fix a typo that generates errors during GPU kernel initialization.

- Fix regression of '--agent-ip-override' cli option.

0.8.1

------------------

- Minor internal polishing release.

0.8.0

------------------

**CHANGES**

- Bump version to 0.8 to match with sorna-manager and sorna-client.

**FIXES**

- Fix events lost by HTTP connection timeouts when using ``docker.events.run()`` from
aiodocker.  (It is due to default 5-minute timeout set by aiohttp)

- Correct task cancellation

0.7.5

------------------

**CHANGES**

- Add new aliases for "git" kernel: "git-shell" and "shell"

0.7.4

------------------

**CHANGES**

- Now it uses `aiodocker`_ instead of `docker-py`_ to
prevent timeouts with many concurrent requests.

NOTE: You need to run ``pip install -r requirements.txt`` to install the
non-pip (GitHub) version of aiodocker correctly, before running
``pip install sorna-agent``.

**FIXES**

- Fix corner-case exceptions in statistics/heartbeats.

.. _aiodocker: https://github.com/achimnol/aiodocker

.. _dockerpy: https://github.com/docker/docker-py

0.7.3

------------------

**CHANGES**

- Increase docker API timeouts.

**FIXES**

- Fix heartbeats stop working after kernel/agent timeouts.

- Fix exception logging in the main server loop.

0.7.2

------------------

**FIXES**

- Hotfix for missing dependency: coloredlogs

0.7.1

------------------

**NEW**

- ``--agent-ip-override`` CLI option to override the IP address of agent
reported to the manager.

0.7.0

------------------

**NEW**

- Add support for kernel restarts.
Restarting preserves kernel metadata and its ID, but removes and recreates
the working volume and the container itself.

- Add ``--debug`` option to the CLI command.

0.6.0

------------------

**NEW**

- Add support for GPU-enabled kernels (using `nvidia-docker plugin`_).
The kernel images must be built upon nvidia-docker's base Ubuntu images and
have the label "io.sorna.nvidia.enabled" set ``yes``.

**CHANGES**

- Change the agent to add "lablup/" prefix when creating containers from
kernel image names, to ease setup and running using the public docker
repository.  (e.g., "lablup/kernel-python3" instead of "kernel-python3")

- Change the prefix of kernel image labels from "com.lablup.sorna." to
"io.sorna." for simplicity.

- Increase the default idle timeout to 30 minutes for offline tutorial/workshops.

- Limit the CPU cores available in kernel containers.
It uses an optional "io.sorna.maxcores" label (default is 1 when not
specified) to determine the requested number of CPU cores in kernels, with a
hard limit of 4.

NOTE: You will still see the full count of CPU cores of the underlying
system when running ``os.cpu_count()``, ``multiprocessing.cpu_count()`` or
``os.sysconf("SC_NPROCESSORS_ONLN")`` because the limit is enforced by the CPU
affinity mask.  To get the correct result, try
``len(os.sched_getaffinity(os.getpid()))``.

.. _nvidia-docker plugin: https://github.com/NVIDIA/nvidia-docker

0.5.0

------------------

**NEW**

- First public release.


<!-- vim: set et: -->