Tensorflow-data-validation

Latest version: v1.15.1

Safety actively analyzes 630254 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 7 of 8

0.21.0

Major Features and Improvements

* Started depending on the CSV parsing / type inferring utilities provided
by `tfx-bsl` (since tfx-bsl 0.15.2). This also brings performance improvements
to the CSV decoder (~2x faster in decoding. Type inferring performance is not
affected).
* Compute bytes statistics for features of BYTES type. Avoid computing topk and
uniques for such features.
* Added LiftStatsGenerator which computes lift between one feature (typically a
label) and all other categorical features.

Bug Fixes and Other Changes

* Exclude examples in which the entire sparse feature is missing when
calculating sparse feature statistics.
* Validate min_examples_count dataset constraint.
* Document the schema fields, statistics fields, and detection condition for
each anomaly type that TFDV detects.
* Handle null array in cross feature stats generator, top-k & uniques combiner
stats generator, and sklearn mutual information generator.
* Handle infinity in basic stats generator.
* Set num_missing and num_examples correctly in the presence of sparse
features.
* Compute weighted feature stats for all weighted features declared in schema.
* Enforce that mutual information is non-negative.
* Depends on `tensorflow-metadata>=0.21.0,<0.22`.
* Depends on `pyarrow>=0.15` (removed the upper bound as it is determined by
`tfx-bsl`).
* Depends on `tfx-bsl>=0.21.0,<0.22`
* Depends on `apache-beam>=2.17,<3`
* Validate that float feature does not contain NaNs (if disallow_nan is True).

Breaking Changes

* Changed the behavior regarding to statistics over CSV data:

- Previously, if a CSV column was mixed with integers and empty strings, FLOAT
statistics will be collected for that column. A change was made so INT
statistics would be collected instead.

* Removed `csv_decoder.DecodeCSVToDict` as `Dict[str, np.ndarray]` had no longer
been the internal data representation any more since 0.14.

Deprecations

0.15.0

Major Features and Improvements

* Generate statistics for sparse features.
* Directly convert a batch of tf.Examples to Arrow tables. Avoids conversion of
tf.Example to intermediate Dict representation.

Bug Fixes and Other Changes

* Generate statistics for the weight feature.
* Support validation and schema inference from sliced statistics that include
the default slice (validation/inference will be done using the default slice
statistics).
* Avoid flattening null arrays.
* Set `weighted_num_examples` field in the statistics proto if a weight
feature is specified.
* Replace DecodedExamplesToTable with a Python implementation.
* Building TFDV from source does not need pyarrow anymore.
* Depends on `apache-beam[gcp]>=2.16,<3`.
* Depends on `six>=1.12,<2`.
* Depends on `scikit-learn>=0.18,<0.22`.
* Depends on `tfx-bsl>=0.15,<0.16`.
* Depends on `tensorflow-metadata>=0.15,<0.16`.
* Depends on `tensorflow-transform>=0.15,<0.16`.
* Depends on `tensorflow>=1.15,<3`.
* Starting from 1.15, package
`tensorflow` comes with GPU support. Users won't need to choose between
`tensorflow` and `tensorflow-gpu`.
* Caveat: `tensorflow` 2.0.0 is an exception and does not have GPU
support. If `tensorflow-gpu` 2.0.0 is installed before installing
`tensorflow-data-validation`, it will be replaced with `tensorflow` 2.0.0.
Re-install `tensorflow-gpu` 2.0.0 if needed.

Breaking Changes

Deprecations

0.14.1

Major Features and Improvements

* Add support for custom schema transformations when inferring schema.

Bug Fixes and Other Changes

* Fix incorrect file hashes in the TFDV wheel.
* Fix DOMException when embedding visualization in iframe.

Breaking Changes

Deprecations

0.14.0

Major Features and Improvements

* Performance improvement due to optimizing inner loops.
* Add support for time semantic domain related statistics.
* Performance improvement due to batching accumulators before merging.
* Add utility method `validate_examples_in_tfrecord`, which identifies anomalous
examples in TFRecord files containing TFExamples and generates statistics for
those anomalous examples.
* Add utility method `validate_examples_in_csv`, which identifies anomalous
examples in CSV files and generates statistics for those anomalous examples.
* Add fast TF example decoder written in C++.
* Make `BasicStatsGenerator` to take arrow table as input. Example batches are
converted to Apache Arrow tables internally and we are able to make use of
vectorized numpy functions. Improved performance of BasicStatsGenerator
by ~40x.
* Make `TopKUniquesStatsGenerator` and `TopKUniquesCombinerStatsGenerator` to
take arrow table as input.
* Add `update_schema` API which updates the schema to conform to statistics.
* Add support for validating changes in the number of examples between the
current and previous spans of data (using the existing `validate_statistics`
function).
* Support building a manylinux2010 compliant wheel in docker.
* Add support for cross feature statistics.

Bug Fixes and Other Changes

* Expand unit test coverage.
* Update natural language stats generator to generate stats if actual ratio
equals `match_ratio`.
* Use `__slots__` in accumulators.
* Fix overflow warning when generating numeric stats for large integers.
* Set max value count in schema when the feature has same valency, thereby
inferring shape for multivalent required features.
* Fix divide by zero error in natural language stats generator.
* Add `load_anomalies_text` and `write_anomalies_text` utility functions.
* Define ReasonFeatureNeeded proto.
* Add support for Windows OS.
* Make semantic domain stats generators to take arrow column as input.
* Fix error in number of missing examples and total number of examples
computation.
* Make FeaturesNeeded serializable.
* Fix memory leak in fast example decoder.
* Add `semantic_domain_stats_sample_rate` option to compute semantic domain
statistics over a sample.
* Increment refcount of None in fast example decoder.
* Add `compression_type` option to `generate_statistics_from_*` methods.
* Add link to SysML paper describing some technical details behind TFDV.
* Add Python types to the source code.
* Make`GenerateStatistics` generate a DatasetFeatureStatisticsList containing a
dataset with num_examples == 0 instead of an empty proto if there are no
examples in the input.
* Depends on `absl-py>=0.7,<1`
* Depends on `apache-beam[gcp]>=2.14,<3`
* Depends on `numpy>=1.16,<2`.
* Depends on `pandas>=0.24,<1`.
* Depends on `pyarrow>=0.14.0,<0.15.0`.
* Depends on `scikit-learn>=0.18,<0.21`.
* Depends on `tensorflow-metadata>=0.14,<0.15`.
* Depends on `tensorflow-transform>=0.14,<0.15`.

Breaking Changes

* Change `examples_threshold` to `values_threshold` and update documentation to
clarify that counts are of values in semantic domain stats generators.
* Refactor IdentifyAnomalousExamples to remove sampling and output
(anomaly reason, example) tuples.
* Rename `anomaly_proto` parameter in anomalies utilities to `anomalies` to
make it more consistent with proto and schema utilities.
* `FeatureNameStatistics` produced by `GenerateStatistics` is now identified
by its `.path` field instead of the `.name` field. For example:


feature {
name: "my_feature"
}

becomes:


feature {
path {
step: "my_feature"
}
}

* Change `validate_instance` API to accept an Arrow table instead of a Dict.
* Change `GenerateStatistics` API to accept Arrow tables as input.

Deprecations

0.13.1

Major Features and Improvements

Bug Fixes and Other Changes

* Modify validation logic to raise `SCHEMA_MISSING_COLUMN` anomaly when
observing a feature with no stats (was still broken, now fixed).

Breaking Changes

Deprecations

0.13.0

Major Features and Improvements

* Use joblib to exploit multiprocessing when computing statistics over a pandas
dataframe.
* Add support for semantic domain related statistics (natural language, image),
enabled by `StatsOptions.enable_semantic_domain_stats`.
* Python 3.5 is supported.

Bug Fixes and Other Changes

* Expand unit test coverage.
* Modify validation logic to raise `SCHEMA_MISSING_COLUMN` anomaly when
observing a feature with no stats.
* Add utility functions `write_stats_text` and `load_stats_text` to write and
load DatasetFeatureStatisticsList protos.
* Avoid using multiprocessing by default when generating statistics over a
dataframe.
* Depends on `joblib>=0.12,<1`.
* Depends on `tensorflow-transform>=0.13,<0.14`.
* Depends on `tensorflow-metadata>=0.12.1,<0.14`.
* Requires pre-installed `tensorflow>=1.13.1,<2`.
* Depends on `apache-beam[gcp]>=2.11,<3`.
* Depends on `absl>=0.1.6,<1`.

Breaking Changes

Deprecations

Page 7 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.