Koalas

Latest version: v1.8.2

Safety actively analyzes 629678 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 9

1.8.2

Koalas 1.8.2 is a maintenance release.
Koalas is [officially included in PySpark as **pandas API on Spark** in Apache Spark 3.2](https://issues.apache.org/jira/browse/SPARK-34849). In Apache Spark 3.2+, please use Apache Spark directly.

Although moving to pandas API on Spark is recommended, Koalas 1.8.2 still works with Spark 3.2 (2203).

Improvements and bug fixes

- _builtin_table import in groupby apply (changed in pandas>=1.3.0). (2184)

1.8.1

Koalas 1.8.1 is a maintenance release. Koalas will be [officially included in PySpark in the upcoming Apache Spark 3.2](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html). In Apache Spark 3.2+, please use Apache Spark directly.

Improvements and bug fixes

- Remove the upperbound for numpy. (2166)
- Allow Python 3.9 when the underlying PySpark is 3.1 and above. (2167)

Along with the following fixes:
- Support x and y properly in plots (both matplotlib and plotly). (2172)
- Fix Index.different to work properly. (2173)
- Fix backward compatibility for Python version 3.5.*. (2174)

1.8.0

Koalas 1.8.0 is the last minor release because Koalas will be [officially included in PySpark in the upcoming Apache Spark 3.2](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html). In Apache Spark 3.2+, please use Apache Spark directly.

Categorical type and `ExtensionDtype`

We added the support of pandas' categorical type (2064, 2106).

python
>>> s = ks.Series(list("abbccc"), dtype="category")
>>> s
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.codes
0 0
1 1
2 1
3 2
4 2
5 2
dtype: int8
>>> idx = ks.CategoricalIndex(list("abbccc"))
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.codes
Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
>>> idx.categories
Index(['a', 'b', 'c'], dtype='object')

and [ExtensionDtype](https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionDtype.html) as type arguments to annotate return types (#2120, 2123, 2132, 2127, 2126, 2125, 2124):

python
def func() -> ks.Series[pd.Int32Dtype()]:
...

Other new features, improvements and bug fixes

We added the following new features:

DataFrame:

- `first` (2128)
- `at_time` (2116)

Series:

- `at_time` (2130)
- `first` (2128)
- `between_time` (2129)

DatetimeIndex:

- `indexer_between_time` (2104)
- `indexer_at_time` (2109)
- `between_time` (2111)

Along with the following fixes:

- Support tuple to (DataFrame|Series).replace() (2095)
- Check index_dtype and data_dtypes more strictly. (2100)
- Return actual values via toPandas. (2077)
- Add lines and orient to read_json and to_json to improve error message (2110)
- Fix isin to accept numpy array (2103)
- Allow multi-index column names for inferring return type schema with names. (2117)
- Add a short JDBC user guide (2148)
- Remove upper bound pandas 1.2 (2141)
- Standardize exceptions of arithmetic operations on Datetime-like data (2101)

1.7.0

Switch the default plotting backend to Plotly

We switched the default plotting backend from Matplotlib to Plotly (2029, 2033). In addition, we added more Plotly methods such as `DataFrame.plot.kde` and `Series.plot.kde` (2028).

python
import databricks.koalas as ks
kdf = ks.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
'b': [1, 2, 3, 4, 5, 6, 7],
'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
kdf.plot.hist()

![Koalas_plotly_hist_plot](https://user-images.githubusercontent.com/44108233/110273113-33fcc380-800f-11eb-8d6f-e12fb3bf7bd0.png)

Plotting backend can be switched to `matplotlib` by setting `ks.options.plotting.backend` to `matplotlib`.

python
ks.options.plotting.backend = "matplotlib"

Add Int64Index, Float64Index, DatatimeIndex

We added more types of `Index` such as `Index64Index`, `Float64Index` and `DatetimeIndex` (2025, 2066).

When creating an index, `Index` instance is always returned regardless of the data type.

But now `Int64Index`, `Float64Index` or `DatetimeIndex` is returned depending on the data type of the index.

python
>>> type(ks.Index([1, 2, 3]))
<class 'databricks.koalas.indexes.numeric.Int64Index'>

1.6.0

Improved Plotly backend support

We improved plotting support by implementing pie, histogram and box plots with Plotly plot backend. Koalas now can plot data with Plotly via:

- `DataFrame.plot.pie` and `Series.plot.pie` (1971)
![Screen Shot 2021-01-22 at 6 32 48 PM](https://user-images.githubusercontent.com/6477701/105473278-48dffa80-5ce0-11eb-8438-b513d205d5b4.png)

- `DataFrame.plot.hist` and `Series.plot.hist` (1999)
![Screen Shot 2021-01-22 at 6 32 38 PM](https://user-images.githubusercontent.com/6477701/105473276-47aecd80-5ce0-11eb-93bf-f91b61291aad.png)

- `Series.plot.box` (2007)
![Screen Shot 2021-01-22 at 6 32 31 PM](https://user-images.githubusercontent.com/6477701/105473275-467da080-5ce0-11eb-8a81-3330799e1229.png)

In addition, we optimized histogram calculation as a single pass in `DataFrame` (1997) instead of launching each job to calculate each `Series` in `DataFrame`.

Operations between Series and Index

The operations between `Series` and `Index` are now supported as below (1996):

python
>>> kser = ks.Series([1, 2, 3, 4, 5, 6, 7])
>>> kidx = ks.Index([0, 1, 2, 3, 4, 5, 6])

>>> (kser + 1 + 10 * kidx).sort_index()
0 2
1 13
2 24
3 35
4 46
5 57
6 68
dtype: int64
>>> (kidx + 1 + 10 * kser).sort_index()
0 11
1 22
2 33
3 44
4 55
5 66
6 77
dtype: int64

Support setting to a `Series` via attribute access

We have added the support of setting a column via attribute assignment in `DataFrame`, (1989).

python
>>> kdf = ks.DataFrame({'A': [1, 2, 3, None]})
>>> kdf.A = kdf.A.fillna(kdf.A.median())
>>> kdf
A

1.5.0

Index operations support

We improved Index operations support (1944, 1955).

Here are some examples:

- Before
py
>>> kidx = ks.Index([1, 2, 3, 4, 5])
>>> kidx + kidx
Int64Index([2, 4, 6, 8, 10], dtype='int64')
>>> kidx + kidx + kidx
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series

py
>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series

- After
python
>>> kidx = ks.Index([1, 2, 3, 4, 5])
>>> kidx + kidx + kidx
Int64Index([3, 6, 9, 12, 15], dtype='int64')

python
>>> ks.options.compute.ops_on_diff_frames = True
>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Int64Index([7, 9, 13, 11, 15], dtype='int64')

Other new features and improvements

We added the following new features:

DataFrame:

- `swaplevel` (1928)
- `swapaxes` (1946)
- `dot` (1945)
- `itertuples` (1960)

Series:

- `swaplevel` (1919)
- `swapaxes` (1954)

Index:

- `to_list` (1948)

MultiIndex:

- `to_list` (1948)

GroupBy:
- `tail` (1949)
- `median` (1957)

Other improvements and bug fixes

- Support DataFrame parameter in Series.dot (1931)
- Add a best practice for checkpointing. (1930)
- Remove implicit switch-ons of "compute.ops_on_diff_frames" (1953)
- Fix Series._to_internal_pandas and introduce Index._to_internal_pandas. (1952)
- Fix first/last_valid_index to support empty column DataFrame. (1923)
- Use pandas' transpose when the data is expected to be small. (1932)
- Fix tail to use the resolved copy (1942)
- Avoid unneeded reset_index in DataFrameGroupBy.describe. (1951)
- TypeError when Index.name / Series.name is not a hashable type (1883)
- Adjust data column names before attaching default index. (1947)
- Add plotly into the optional dependency in Koalas (1939)
- Add plotly backend test cases (1938)
- Don't pass stacked in plotly area chart (1934)
- Set upperbound of matplotlib to avoid failure on Ubuntu (1959)
- Fix GroupBy.descirbe for multi-index columns. (1922)
- Upgrade pandas version in CI (1961)
- Compare Series from the same anchor (1956)
- Add videos from Data+AI Summit 2020 EUROPE. (1963)
- Set PYARROW_IGNORE_TIMEZONE for binder. (1965)

Page 4 of 9

Releases

Has known vulnerabilities

Previous Next

Koalas

Page 4 of 9

1.8.2

1.8.1

1.8.0

1.7.0

1.6.0

1.5.0

Page 4 of 9

Links

Releases