Koalas

Latest version: v1.8.2

Safety actively analyzes 629639 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 9

1.0.0

Better pandas API coverage

We implemented many APIs and features equivalent with pandas such as plotting, grouping, windowing, I/O, and transformation, and now Koalas reaches the pandas API coverage close to 80% in Koalas 1.0.0.

![](https://user-images.githubusercontent.com/6477701/85115870-59feef80-b257-11ea-9ae9-51e5ed1b54da.png)

1.000000

0.693147

0.33.0

`apply` and `transform` Improvements

We added supports to have positional/keyword arguments for `apply`, `apply_batch`, `transform`, and `transform_batch` in `DataFrame`, `Series`, and `GroupBy`. (1484, 1485, 1486)

py
>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
id
0 4
1 5
2 6
3 7
4 8
5 9
6 10
7 11
8 12
9 13


py
>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
0 6
1 7
2 8
3 9
4 10
5 11
6 12
7 13
8 14
9 15
Name: id, dtype: int64


py
>>> kdf = ks.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"])
>>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
a b c
0 5 5 5
1 7 5 11
2 9 7 21
3 11 9 35
4 13 13 53
5 15 19 75


Spark Schema

We add `spark_schema` and `print_schema` to know the underlying Spark Schema. (1446)

py
>>> kdf = ks.DataFrame({'a': list('abc'),
... 'b': list(range(1, 4)),
... 'c': np.arange(3, 6).astype('i1'),
... 'd': np.arange(4.0, 7.0, dtype='float64'),
... 'e': [True, False, True],
... 'f': pd.date_range('20130101', periods=3)},
... columns=['a', 'b', 'c', 'd', 'e', 'f'])

>>> Print the schema out in Spark’s DDL formatted string
>>> kdf.spark_schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark_schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'

>>> Print out the schema as same as DataFrame.printSchema()
>>> kdf.print_schema()
root
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)

>>> kdf.print_schema(index_col='index')
root
|-- index: long (nullable = false)
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)


GroupBy Improvements

We fixed many bugs of `GroupBy` as listed below.

- Fix groupby when as_index=False. (1457)
- Make groupby.apply in pandas<0.25 run the function only once per group. (1462)
- Fix Series.groupby on the Series from different DataFrames. (1460)
- Fix GroupBy.head to recognize agg_columns. (1474)
- Fix GroupBy.filter to follow complex group keys. (1471)
- Fix GroupBy.transform to follow complex group keys. (1472)
- Fix GroupBy.apply to follow complex group keys. (1473)
- Fix GroupBy.fillna to use GroupBy._apply_series_op. (1481)
- Fix GroupBy.filter and apply to handle agg_columns. (1480)
- Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (1488)
- Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (1490)

Other new features and improvements

We added the following new feature:

SeriesGroupBy:

- `filter` (1483)

Other improvements

- dtype for DateType should be np.dtype("object"). (1447)
- Make reset_index disallow the same name but allow it when drop=True. (1455)
- Fix named aggregation for MultiIndex (1435)
- Raise ValueError that is not raised now (1461)
- Fix get dummies when uses the prefix parameter whose type is dict (1478)
- Simplify DataFrame.columns setter. (1489)

0.32.0

Koalas documentation redesign

Koalas documentation was redesigned with a better theme, [pydata-sphinx-theme](https://github.com/pandas-dev/pydata-sphinx-theme). Please check [the new Koalas documentation site](https://koalas.readthedocs.io/en/latest/) out.

![](https://user-images.githubusercontent.com/6477701/80072722-97117300-8581-11ea-8739-140356df4576.png)

`transform_batch` and `apply_batch`

We added the APIs that enable you to directly transform and apply a function against Koalas Series or DataFrame. `map_in_pandas` is deprecated and now renamed to `apply_batch`.

python
import databricks.koalas as ks
kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pdf):
return pdf + 1 should always return the same length as input.

kdf.transform_batch(pandas_plus)


python
import databricks.koalas as ks
kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pdf):
return pdf[pdf.a > 1] allow arbitrary length

kdf.apply_batch(pandas_plus)


Please also check [Transform and apply a function](https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html) in Koalas documentation.


Other new features and improvements

We added the following new feature:

DataFrame:​

- `truncate` (1408)
- `hint` (1415)

SeriesGroupBy:

- `unique` (1426)

Index:

- `spark_column` (1438)

Series:

- `spark_column` (1438)

MultiIndex:

- `spark_column` (1438)


Other improvements

- Fix from_pandas to handle the same index name as a column name. (1419)
- Add documentation about non-Koalas APIs (1420)
- Hot-fixing the lack of keyword argument 'deep' for DataFrame.copy() (1423)
- Fix Series.div when divide by zero (1412)
- Support expand parameter if n is a positive integer in Series.str.split/rsplit. (1432)
- Make Series.astype(bool) follow the concept of "truthy" and "falsey". (1431)
- Fix incompatible behaviour with pandas for floordiv with np.nan (1429)
- Use mapInPandas for apply_batch API in Spark 3.0 (1440)
- Use F.datediff() for subtraction of dates as a workaround. (1439)

0.31.0

PyArrow>=0.15 support is back

We added PyArrow>=0.15 support back (1110).

Note that, when working with `pyarrow>=0.15` and `pyspark<3.0`, Koalas will set an environment variable `ARROW_PRE_0_15_IPC_FORMAT=1` if it does not exist, as per the instruction in [SPARK-29367](https://issues.apache.org/jira/browse/SPARK-29367), but it will NOT work if there is a Spark context already launched. In that case, you have to manage the environment variable by yourselves.

Spark specific improvements

Broadcast hint

We added `broadcast` function in namespace.py (1360).

We can use it with `merge`, `join`, and `update` which invoke join operation in Spark when you know one of the DataFrame is small enough to fit in memory, and we can expect much more performant than shuffle-based joins.

For example,

py
>>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
>>> merged.explain()
== Physical Plan ==
...
...BroadcastHashJoin...
...


persist function and storage level

We added `persist` function to specify the storage level when caching (1381), and also, we added `storage_level` property to check the current storage level (1385).

py
>>> with df.cache() as cached_df:
... print(cached_df.storage_level)
...
Disk Memory Deserialized 1x Replicated

>>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
... print(cached_df.storage_level)
...
Memory Serialized 1x Replicated


Other new features and improvements

We added the following new feature:

DataFrame:

- `to_markdown` (1377)
- `squeeze` (1389)

Series:

- `squeeze` (1389)
- `asof` (1366)

Other improvements

- Add a way to specify index column in I/O APIs (1379)
- Fix `iloc.__setitem__` with the other Series from the same DataFrame. (1388)
- Add support Series from different DataFrames for `loc/iloc.__setitem__`. (1391)
- Refine `__setitem__` for loc/iloc with DataFrame. (1394)
- Help misuse of options argument. (1402)
- Add blog posts in Koalas documentation (1406)
- Fix mod & rmod for matching with pandas. (1399)

Page 6 of 9

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.