Changelogs » Carefree-data

PyUp Safety actively tracks 273,605 Python packages for vulnerabilities and notifies you when to upgrade.

Carefree-data

0.1.7

Release Notes
  
  `TabularData`
  🎉🎉 `TabularData` now supports time series datasets! 🎉🎉
  
  > For detailed examples, please refers to time series examples in [`carefree-learn`](https://github.com/carefree0910/carefree-learn/tree/dev/examples/time_series)!

0.1.6

Miscellaneous fixes and updates.

0.1.5

Release Notes
  
  `TabularData`
  
  + `skip_first` is replaced by `has_column_names`, so `column_names` will now be extracted from `csv` files automatically.
  + `quote_char` is introduced. For instance, `"` is the default `quote_char` for `csv` files.
  
  Here's an example:
  
  `quote_test.csv`:
  text
  f1,f2,f3,f4,f5
  1,"2, 3",4","5,6
  
  
  Here's the unittest:
  
  python
  data = TabularData().read("quote_test.csv")
  self.assertDictEqual(data.column_names, {0: "f1", 1: "f2", 2: "f3", 3: "f4", 4: "f5"})
  self.assertListEqual(data.raw.x[0], ["1", '"2, 3"', '4"', '"5'])
  self.assertListEqual(data.raw.y[0], ["6"])
  
  
  `read_file`
  `read_file` now supports `contains_labels` argument
  
  Breaking changes
  
  The meaning of `label_idx` has changed. Previously, `None` means that current data file does not contain label column. But now, we introduced `contains_labels` as substitute, and `None` means that we don't know the exact `label_idx`, and has to infer it from other information.

0.1.4

Miscellaneous fixes and updates.

0.1.3

Miscellaneous fixes and updates.

0.1.2

Release Notes
  
  `DataSplitter`
  Util class for dividing dataset based on task type
  > If it's regression task, it's simple to split data
  > If it's classification task, we need to split data based on labels, because we need to ensure the divided data contain all labels available
  
  Examples
  python
  import numpy as np
  
  from cfdata.types import np_int_type
  from cfdata.tabular.types import TaskTypes
  from cfdata.tabular.wrapper import TabularDataset
  from cfdata.tabular.utils import DataSplitter
  
  x = np.arange(12).reshape([6, 2])
  create an imbalance dataset
  y = np.zeros(6, np_int_type)
  y[[-1, -2]] = 1
  dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
  data_splitter = DataSplitter().fit(dataset)
  labels in result will keep its ratio
  result = data_splitter.split(3)
  [0 0 1]
  print(result.dataset.y.ravel())
  data_splitter.reset()
  result = data_splitter.split(0.5)
  [0 0 1]
  print(result.dataset.y.ravel())
  at least one sample of each class will be kept
  y[-2] = 0
  dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
  data_splitter = DataSplitter().fit(dataset)
  result = data_splitter.split(2)
  [0 0 0 0 0 1] [0 1]
  print(y, result.dataset.y.ravel())
  
  
  `KFold`
  Util class which can perform k-fold data splitting:
  1. X = {x1, x2, ..., xn} -> [X1, X2, ..., Xk]
  2. Xi ∩ Xj = ∅, ∀ i, j = 1,..., K
  3. X1 ∪ X2 ∪ ... ∪ Xk = X
  
  > Notice that `KFold` does not always hold the principles listed above, because `DataSplitter` will ensure that at least one sample of each class will be kept. In this case, when we apply `KFold` to an imbalance dataset, `KFold` may slightly violate principle 2. and 3.
  
  Parameters
  + k : int, number of folds
  + dataset : TabularDataset, dataset which we want to split
  + **kwargs : used to initialize `DataSplitter` instance
  
  Examples
  python
  import numpy as np
  
  from cfdata.types import np_int_type
  from cfdata.tabular.types import TaskTypes
  from cfdata.tabular.wrapper import TabularDataset
  from cfdata.tabular.utils import KFold
  
  x = np.arange(12).reshape([6, 2])
  create an imbalance dataset
  y = np.zeros(6, np_int_type)
  y[[-1, -2]] = 1
  dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
  k_fold = KFold(3, dataset)
  for train_fold, test_fold in k_fold:
  print(np.vstack([train_fold.dataset.x, test_fold.dataset.x]))
  print(np.vstack([train_fold.dataset.y, test_fold.dataset.y]))
  
  
  `KRandom`
  Util class which can perform k-random data splitting:
  1. X = {x1, x2, ..., xn} -> [X1, X2, ..., Xk]
  2. idx{X1} ≠ idx{X2} ≠ ... ≠ idx{Xk}, where idx{X} = {1, 2, ..., n}
  3. X1 = X2 = ... = Xk = X
  
  Parameters
  + k : int, number of folds
  + num_test : {int, float}
  + if float and  < 1 : ratio of the test dataset
  + if int   and  > 1 : exact number of test samples
  + dataset : TabularDataset, dataset which we want to split
  + **kwargs : used to initialize `DataSplitter` instance
  
  Examples
  python
  import numpy as np
  
  from cfdata.types import np_int_type
  from cfdata.tabular.types import TaskTypes
  from cfdata.tabular.wrapper import TabularDataset
  from cfdata.tabular.utils import KRandom
  
  x = np.arange(12).reshape([6, 2])
  create an imbalance dataset
  y = np.zeros(6, np_int_type)
  y[[-1, -2]] = 1
  dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
  k_random = KRandom(3, 2, dataset)
  for train_fold, test_fold in k_random:
  print(np.vstack([train_fold.dataset.x, test_fold.dataset.x]))
  print(np.vstack([train_fold.dataset.y, test_fold.dataset.y]))
  
  
  `ImbalancedSampler`
  Util class which can sample imbalance dataset in a balanced way
  
  Parameters
  + data : TabularData, data which we want to sample from
  + imbalance_threshold : float
  + for binary class cases, if n_pos / n_neg < threshold, we'll treat data as imbalance data
  + for multi  class cases, if n_min_class / n_max_class < threshold, we'll treat data as imbalance data
  + shuffle : bool, whether shuffle the returned indices
  + sample_method : str, sampling method used in `cftool.misc.Sampler`
  + currently only 'multinomial' is supported
  + verbose_level : int, verbose level used in `LoggingMixin`
  
  Examples
  python
  import numpy as np
  
  from cfdata.types import np_int_type
  from cfdata.tabular import TabularData
  from cfdata.tabular.utils import ImbalancedSampler
  from cftool.misc import get_counter_from_arr
  
  n = 20
  x = np.arange(2 * n).reshape([n, 2])
  create an imbalance dataset
  y = np.zeros([n, 1], np_int_type)
  y[-1] = [1]
  data = TabularData().read(x, y)
  sampler = ImbalancedSampler(data)
  Counter({1: 12, 0: 8})
  This may vary, but will be rather balanced
  You might notice that positive samples are even more than negative samples!
  print(get_counter_from_arr(y[sampler.get_indices()]))
  
  
  `DataLoader`
  Util class which can generated batches from `ImbalancedSampler`
  
  Examples
  python
  import numpy as np
  
  from cfdata.types import np_int_type
  from cfdata.tabular import TabularData
  from cfdata.tabular.utils import DataLoader
  from cfdata.tabular.utils import ImbalancedSampler
  from cftool.misc import get_counter_from_arr
  
  n = 20
  x = np.arange(2 * n).reshape([n, 2])
  y = np.zeros([n, 1], np_int_type)
  y[-1] = [1]
  data = TabularData().read(x, y)
  sampler = ImbalancedSampler(data)
  loader = DataLoader(16, sampler)
  y_batches = []
  for x_batch, y_batch in loader:
  y_batches.append(y_batch)
  (16, 1) (16, 1)
  (4, 1) (4, 1)
  print(x_batch.shape, y_batch.shape)
  Counter({1: 11, 0: 9})
  print(get_counter_from_arr(np.vstack(y_batches).ravel()))

0.1.1

First release