Koalas

Latest version: v1.8.2

Safety actively analyzes 629639 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 9

1.4.0

Better type support

We improved the type mapping between pandas and Koalas (1870, 1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.

Here are some examples:

- Added `np.float32` and `"float32"` (matched to `FloatType`)

python
>>> ks.Series([10]).astype(np.float32)
0 10.0
dtype: float32

>>> ks.Series([10]).astype("float32")
0 10.0
dtype: float32

- Added `np.datetime64` and `"datetime64[ns]"` (matched to `TimestampType`)

python
>>> ks.Series(["2020-10-26"]).astype(np.datetime64)
0 2020-10-26
dtype: datetime64[ns]

>>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
0 2020-10-26
dtype: datetime64[ns]

- Fixed `np.int` to match `LongType`, not `IntegerType`.

python
>>> pd.Series([100]).astype(np.int)
0 100.0
dtype: int64

>>> ks.Series([100]).astype(np.int)
0 100.0
dtype: int32 This fixed to `int64` now.

- Fixed `np.float` to match `DoubleType`, not `FloatType`.

python
>>> pd.Series([100]).astype(np.float)
0 100.0
dtype: float64

>>> ks.Series([100]).astype(np.float)
0 100.0
dtype: float32 This fixed to `float64` now.

We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: [Type Support In Koalas](https://koalas.readthedocs.io/en/latest/user_guide/types.html).

Return type annotations for major Koalas objects

To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (1852, 1857, 1859, 1863, 1871, 1882, 1884, 1889, 1892, 1894, 1898, 1899, 1900, 1902).

The return type annotations help auto-completion libraries, such as [Jedi](https://jedi.readthedocs.io/en/latest/), to infer the actual data type and provide proper suggestions:

- Before

![Before](https://user-images.githubusercontent.com/506656/98856023-d1553b80-2411-11eb-9bde-ed2e2012c8b1.png)

- After

![After](https://user-images.githubusercontent.com/506656/98856035-d4e8c280-2411-11eb-8d47-05f546695f20.png)

It also helps mypy enable static analysis over the method body.

pandas 1.1.4 support

We verified the behaviors of pandas 1.1.4 in Koalas.

As pandas 1.1.4 introduced a behavior change related to `MultiIndex.is_monotonic` (`MultiIndex.is_monotonic_increasing`) and `MultiIndex.is_monotonic_decreasing` (pandas-dev/pandas37220), Koalas also changes the behavior (1881).

Other new features and improvements

We added the following new features:

DataFrame:

- `__neg__` (1847)
- `rename_axis` (1843)
- `spark.repartition` (1864)
- `spark.coalesce` (1873)
- `spark.checkpoint` (1877)
- `spark.local_checkpoint` (1878)
- `reindex_like` (1880)

Series:

- `rename_axis` (1843)
- `compare` (1802)
- `reindex_like` (1880)

Index:

- `intersection` (1747)

MultiIndex:

- `intersection` (1747)

Other improvements and bug fixes

- Use SF.repeat in series.str.repeat (1844)
- Remove warning when use cache in the context manager (1848)
- Support a non-string name in Series' boxplot (1849)
- Calculate fliers correctly in Series.plot.box (1846)
- Show type name rather than type class in error messages (1851)
- Fix DataFrame.spark.hint to reflect internal changes. (1865)
- DataFrame.reindex supports named columns index (1876)
- Separate InternalFrame.index_map into index_spark_column_names and index_names. (1879)
- Fix DataFrame.xs to handle internal changes properly. (1896)
- Explicitly disallow empty list as index_spark_colum_names and index_names. (1895)
- Use nullable inferred schema in function apply (1897)
- Introduce InternalFrame.index_level. (1890)
- Remove InternalFrame.index_map. (1901)
- Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. (1904)
- Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (1860)
- Fix read_excel to support squeeze argument. (1905)
- Fix to_csv to avoid duplicated option 'path' for DataFrameWriter. (1912)

1.3.0

pandas 1.1 support

We verified the behaviors of pandas 1.1 in Koalas. Koalas now supports pandas 1.1 officially (1688, 1822, 1829).

Support for non-string names

Now we support for non-string names (1784). Previously names in Koalas, e.g., `df.columns`, `df.colums.names`, `df.index.names`, needed to be a string or a tuple of string, but it should allow other data types which are supported by Spark.

**Before:**

py
>>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
>>> kdf.columns
Index(['0', '1'], dtype='object')

**After:**

py
>>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
>>> kdf.columns
Int64Index([0, 1], dtype='int64')

Improve `distributed-sequence` default index

The performance is improved when creating a `distributed-sequence` as a default index type by avoiding the interaction between Python and JVM (1699).

Standardize binary operations between int and str columns

Make behaviors of binary operations (`+`, `-`, `*`, `/`, `//`, `%`) between `int` and `str` columns consistent with respective pandas behaviors (1828).

It standardizes binary operations as follows:

- `+`: raise `TypeError` between int column and str column (or string literal)
- `*`: act as spark SQL `repeat` between int column(or int literal) and str columns; raise `TypeError` if a string literal is involved
- `-`, `/`, `//`, `%(modulo)`: raise `TypeError` if a str column (or string literal) is involved

Other new features and improvements

We added the following new features:

DataFrame:

- `product` (1739)
- `from_dict` (1778)
- `pad` (1786)
- `backfill` (1798)

Series:

- `reindex` (1737)
- `explode` (1777)
- `pad` (1786)
- `argmin` (1790)
- `argmax` (1790)
- `argsort` (1793)
- `backfill` (1798)

Index:

- `inferred_type` (1745)
- `item` (1744)
- `is_unique` (1766)
- `asi8` (1764)
- `is_type_compatible` (1765)
- `view` (1788)
- `insert` (1804)

MultiIndex:

- `inferred_type` (1745)
- `item` (1744)
- `is_unique` (1766)
- `asi8` (1764)
- `is_type_compatible` (1765)
- `from_frame` (1762)
- `view` (1788)
- `insert` (1804)

GroupBy:

- `get_group` (1783)

Other improvements

- Fix DataFrame.mad to work properly (1749)
- Fix Series name after binary operations. (1753)
- Fix GroupBy.cum~ for matching with pandas' behavior (1708)
- Fix cumprod to work properly with Integer columns. (1750)
- Fix DataFrame.join for MultiIndex (1771)
- Exception handling for from_frame properly (1791)
- Fix iloc for slice(None, 0) (1767)
- Fix Series.\_\_repr__ when Series.name is None. (1796)
- DataFrame.reindex supports koalas Index parameter (1741)
- Fix Series.fillna with inplace=True on non-nullable column. (1809)
- Input check in various APIs (1808, 1810, 1811, 1812, 1813, 1814, 1816, 1824)
- Fix to_list work properly in pandas==0.23 (1823)
- Fix Series.astype to work properly (1818)
- Frame.groupby supports dropna (1815)

1.2.0

>>> ks.Series([1, 2, 3])
0 1
1 2
2 3
dtype: int64
>>> kser = ks.Series([1, 2, 3], name="a")
>>> kser.name = None
>>> kser
0 1
1 2
2 3
dtype: int64

More stable "distributed-sequence" default index

Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:

python
>>> from databricks import koalas as ks
>>> ks.options.compute.default_index_type = 'distributed-sequence'
>>> ks.range(10).reset_index()

did not work as below:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
...
File "/.../koalas/databricks/koalas/internal.py", line 620, in offset
current_partition_offset = sums[id.iloc[0]]
KeyError: 103

We investigated and made the default index type more stable (1701). Now it unlikely causes such situations and it is stable enough.

Improve testing infrastructure

We changed the testing infrastructure to use pandas' testing utils for exact check (1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.

Other new features and improvements

We added the following new features:

DataFrame:

- `last_valid_index` (1705)

Series:

- `product` (1677)
- `last_valid_index` (1705)

GroupBy:

- `cumcount` (1702)

Other improvements

- Refine Spark I/O. (1667)
- Set `partitionBy` explicitly in `to_parquet`.
- Add `mode` and `partition_cols` to `to_csv` and `to_json`.
- Fix type hints to use `Optional`.
- Make read_excel read from DFS if the underlying Spark is 3.0.0 or above. (1678, 1693, 1694, 1692)
- Support callable instances to apply as a function, and fix groupby.apply to keep the index when possible (1686)
- Bug fixing for hasnans when non-DoubleType. (1681)
- Support axis=1 for DataFrame.dropna(). (1689)
- Allow assining index as a column (1696)
- Try to read pandas metadata in read_parquet if index_col is None. (1695)
- Include pandas Index object in dataframe indexing options (1698)
- Unified `PlotAccessor` for DataFrame and Series (1662)
- Fix SeriesGroupBy.nsmallest/nlargest. (1713)
- Fix DataFrame.size to consider its number of columns. (1715)
- Fix first_valid_index() for Empty object (1704)
- Fix index name when groupby.apply returns a single row. (1719)
- Support subtraction of date/timestamp with literals. (1721)
- DataFrame.reindex(fill_value) does not fill existing NaN values (1723)

1.1.0

API extensions

We added support for API extensions (1617).

You can register your custom accessors to `DataFrame`, `Seires`, and `Index`.

For example, in your library code:

py
from databricks.koalas.extensions import register_dataframe_accessor

register_dataframe_accessor("geo")
class GeoAccessor:

def __init__(self, koalas_obj):
self._obj = koalas_obj
other constructor logic

property
def center(self):
return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))

def plot(self):
plot this array's data on a map
pass
...

Then, in a session:

py
>>> from my_ext_lib import GeoAccessor
>>> kdf = ks.DataFrame({"longitude": np.linspace(0,10),
... "latitude": np.linspace(0, 20)})
>>> kdf.geo.center
(5.0, 10.0)

>>> kdf.geo.plot()
...

See also: https://koalas.readthedocs.io/en/latest/reference/extensions.html

Plotting backend

We introduced `plotting.backend` configuration (1639).

Plotly (>=4.8) or other libraries that pandas supports can be used as a plotting backend if they are installed in the environment.

py
>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
>>> kdf.plot(title="Example Figure") defaults to backend="matplotlib"

![image](https://user-images.githubusercontent.com/23381512/87221509-26147700-c38a-11ea-9fc5-dd87e3031055.png)

python
>>> fig = kdf.plot(backend="plotly", title="Example Figure", height=500, width=500)
>>> same as:
>>> ks.options.plotting.backend = "plotly"
>>> fig = kdf.plot(title="Example Figure", height=500, width=500)
>>> fig.show()

![image](https://user-images.githubusercontent.com/23381512/87221424-91aa1480-c389-11ea-87cd-6ed81c46c5f0.png)

Each backend returns the figure in their own format, allowing for further editing or customization if required.

python
>>> fig.update_layout(template="plotly_dark")
>>> fig.show()

![image](https://user-images.githubusercontent.com/23381512/87221444-ab4b5c00-c389-11ea-9a0c-341104855510.png)

Koalas accessor

We introduced `koalas` accessor and some methods specific to Koalas (1613, 1628).

`DataFrame.apply_batch`, `DataFrame.transform_batch`, and `Series.transform_batch` are deprecated and moved to `koalas` accessor.

py
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf + 1 should always return the same length as input.
...
>>> kdf.koalas.transform_batch(pandas_plus)
a b
0 2 5
1 3 6
2 4 7

py
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_filter(pdf):
... return pdf[pdf.a > 1] allow arbitrary length
...
>>> kdf.koalas.apply_batch(pandas_filter)
a b
1 2 5
2 3 6

or

py
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pser):
... return pser + 1 should always return the same length as input.
...
>>> kdf.a.koalas.transform_batch(pandas_plus)
0 2
1 3
2 4
Name: a, dtype: int64

See also: https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html

Other new features and improvements

We added the following new features:

DataFrame:

- `tail` (1632)
- `droplevel` (1622)

Series:

- `iteritems` (1603)
- `items` (1603)
- `tail` (1632)
- `droplevel` (1630)

Other improvements

- Simplify Series.to_frame. (1624)
- Make Window functions create a new DataFrame. (1623)
- Fix Series._with_new_scol to use alias. (1634)
- Refine concat to handle the same anchor DataFrames properly. (1627)
- Add sort parameter to concat. (1636)
- Enable to assign list. (1644)
- Use SPARK_INDEX_NAME_FORMAT in combine_frames to avoid ambiguity. (1650)
- Rename spark columns only when index=False. (1649)
- read_csv: Implement reading of number of rows (1656)
- Fixed ks.Index.to_series() to work properly with name paramter (1643)
- Fix fillna to handle "ffill" and "bfill" properly. (1654)

1.0.1

Critical bug fix

We fixed a critical bug introduced in Koalas 1.0.0 (1609).

If we call `DataFrame.rename` with `columns` parameter after some operations on the DataFrame, the operations will be lost:

py
>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
>>> kdf1 = kdf + 1
>>> kdf1
A B C D
0 2 3 4 5
1 6 7 8 9
>>> kdf1.rename(columns={"A": "aa", "B": "bb"})
aa bb C D
0 1 2 3 4
1 5 6 7 8

This should be:

py
>>> pdf1.rename(columns={"A": "aa", "B": "bb"})
aa bb C D
0 2 3 4 5
1 6 7 8 9

Other improvements

- Clean up InternalFrame and around anchor. (1601)
- Fixing DataFrame.iteritems to return generator (1602)
- Clean up groupby to use the anchor. (1610)

1.0

3 NaN
4 NaN

python
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
id new_col
0 0 1.0
1 1 2.0
3 3 4.0
2 2 3.0
4 4 NaN

Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (639)(655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting `DEFAULT_INDEX` as one of three types:

- (default) `one-by-one`: It implements a one-by-one sequence by Window function without
specifying partition. This index type should be avoided when the data is large.

python
>>> ks.range(3)
id
0 0
1 1
2 2

- `distributed-one-by-one`: It implements a one-by-one sequence by group-by and
group-map approach. It still generates a one-by-one sequential index globally.
If the default index must be a one-by-one sequence in a large dataset, this
index can be used.

python
>>> ks.range(3)
id
0 0
1 1
2 2

- `distributed`: It implements a monotonically increasing sequence simply by using
Spark's `monotonically_increasing_id` function. If the index does not have to be
a one-by-one sequence, this index can be used. Performance-wise, this index
almost does not have any penalty comparing to other index types.

python
>>> ks.range(3)
id
25769803776 0
60129542144 1
94489280512 2

Thirdly, we implemented many plot APIs in Series as follows:

- plot.pie() (669)
- plot.area() (670)
- plot.line() (671)
- plot.barh() (673)

See the example below:

python
import databricks.koalas as ks

ks.range(10).to_pandas().id.plot.pie()

![image](https://user-images.githubusercontent.com/6477701/63404049-aa7da480-c41c-11e9-9472-f33e5c302dc6.png)

Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:

- `DataFrame.sort_index()`(637)
- `GroupBy.diff()`(653)
- `GroupBy.rank()`(653)
- `Series.any()`(652)
- `Series.all()`(652)
- `DataFrame.any()`(652)
- `DataFrame.all()`(652)
- `DataFrame.assign()`(657)
- `DataFrame.drop()`(658)
- `DataFrame.reindex()`(659)
- `Series.quantile()`(663)
- `Series,transform()`(663)
- `DataFrame.select_dtypes()`(662)
- `DataFrame.transpose()`(664).

Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:

koalas.DataFrame

- duplicated() (569)
- fillna() (640)
- bfill() (640)
- pad() (640)
- ffill() (640)

koalas.groupby.GroupBy:

- diff() (622)
- nunique() (617)
- nlargest() (654)
- nsmallest() (654)
- idxmax() (649)
- idxmin() (649)

Along with the following improvements:

- Add a basic infrastructure for configurations. (645)
- Always use `column_index`. (648)
- Allow to omit type hint in GroupBy.transform, filter, apply (646)

Page 5 of 9

Releases

Has known vulnerabilities

Previous Next

Koalas

Page 5 of 9

1.4.0

1.3.0

1.2.0

1.1.0

1.0.1

1.0

Page 5 of 9

Links

Releases