Releases · cleanlab/cleanlab

v2.9.0

v2.9.0- Dependency Modernization and Simplified Maintenance

This release streamlines cleanlab's dependencies by removing TensorFlow/Keras support, reducing maintenance burden and improving compatibility with modern Python ecosystems. This is a breaking release for users relying on cleanlab.models.keras, who will need to migrate to alternative model wrappers, custom implementations or PyTorch instead.

What's Changed

Full Changelog: v2.8.0...v2.9.0

v2.8.0

v2.8.0 - Python 3.12-3.14 support and compatibility improvements

This release updates Python version support to 3.10-3.14, resolves compatibility issues with Datasets 4.0.0+, relaxes NumPy version constraints for better flexibility, and streamlines documentation. The update ensures cleanlab works with current and future Python releases while maintaining backward compatibility for supported versions.

What's Changed

Full Changelog: v2.7.1...v2.8.0

v2.7.1 -- New issue manager and improved docs

This release is non-breaking when upgrading from v2.7.0, mostly focused on documentation and testing improvements. The most notable update is a:

  • New identifier column issue manager – detects sequential numerical columns that might influence your model. This feature is available as a preview and requires additional setup to use with Datalab.

Other Updates:

  • 📖 Docs & Readme: Improved clarity.
  • 🛠 Test suite: More stability and consistency.

What's Changed

New Contributors

Full Changelog: v2.7.0...v2.7.1

v2.7.0 -- Broadening Data Quality Checks and ML Workflows

This release introduces new features and improvements aimed at helping users detect complex dataset issues and improve their ML models' robustness. As always, we maintain backward compatibility, making this release non-breaking when upgrading from v2.6.6. We continue to support Python 3.8-3.11 in this version, but support for Python 3.8 will be dropped in a future minor release.

Introducing Spurious Correlation Detection in Datalab

With this release, Datalab now detects spurious correlations in image datasets by default, helping users identify potentially misleading patterns that may lead to overfitting or reduced model generalization.

Spurious correlations occur when models pick up on patterns in the data that are coincidental rather than meaningful. For example, a model might incorrectly associate the background color with a particular label, leading to poor generalization on new data. Identifying these correlations helps ensure more reliable models by minimizing the risk of learning from irrelevant or misleading features.

Detecting spurious correlations in image datasets is straightforward:

from cleanlab import Datalab

lab = Datalab(data=image_dataset, label_name="label_column", image_key="image_column")

lab.find_issues()

lab.report()

You can find a more detailed workflow for finding spurious correlations in our documentation.

This new issue type aims to give users deeper insights into their data, enabling more robust model development.

New Tutorial: Improving ML Performance with Train and Test Set Curation

We've introduced a new tutorial that demonstrates how to carefully use cleanlab (via Datalab) for both training and test data. This approach helps ensure reliable ML model training and evaluation, particularly for noisy datasets.

You can find this tutorial in our documentation: Improving ML Performance via Data Curation with Train vs Test Splits.

Other Major Improvements

  • Optimized Internal Functions: Several internal optimizations have been made, including updates to clip_noise_rates, remove_noise_from_class, and clip_values functions, improving the overall efficiency of cleanlab.
  • Improved Underperforming Group Detection: Enhanced scoring for all underperforming groups, providing more accurate identification of problematic data subsets.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Change Log

Significant changes in this release include:

New Contributors

For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.

v2.6.6

What's Changed

Full Changelog: v2.6.5...v2.6.6

v2.6.5

What's Changed

  • Add end-to-end tests at the end of Datalab quickstart tutorial by @allincowell in #1118
  • Centralize existing functionality for constructing and correcting knn graphs in a separate module by @elisno in #1117, #1119, #1129
  • Optimize multiannotator.py for performance by @gogetron in #1077
  • Optimize value_counts function for performance improvement with missing classes by @gogetron in #1073
  • Improve test coverage for setting confident joint in CleanLearning by @elisno in #1123
  • Switch from np.isnan to pd.isna for null value check by @gogetron in #1096
  • Update pip install instruction in object detection tutorial by @elisno in #1126
  • Refine handling of underperforming_group issue type by @gogetron in #1099
  • Improve compatibility with sklearn 1.5 by removing the deprecated multi_class argument in LogisticRegression by @elisno in #1124
  • Display exact duplicate sets dynamically in tabular tutorial by @nelsonauner in #1128

New Contributors

Full Changelog: v2.6.4...v2.6.5

v2.6.4

What's Changed

New Contributors

Full Changelog: v2.6.3...v2.6.4

v2.6.3 - Enhanced scores for outliers and near-duplicates

This release is non-breaking when upgrading from v2.6.2.

What's Changed

  • Updated image_key documentation by @sanjanag in #1048
  • Refine Scoring and Enhance Stability for Datasets with Identical Examples by @elisno in #1056
  • Add warning message about TensorFlow compatibility to docs by @elisno in #1057

Full Changelog: v2.6.2...v2.6.3

v2.6.2

This release is non-breaking when upgrading from v2.6.1.

What's Changed

  • Convert DataFrame features to numpy arrays in null value check by @elisno in #1045

Full Changelog: v2.6.1...v2.6.2

v2.6.1 -- Refined Regression Score and Fixes

This release is non-breaking when upgrading from v2.6.0. Some noteworthy updates include:

  1. The label quality score in the cleanlab.regression module is improved to be more human-readable.
    • This only involves rescaling the scores to display a more human-interpretable range of scores, without affecting how your data points are ranked within a dataset according to these scores.
  2. Better address some edge-cases in Datalab.get_issues().

What's Changed

New Contributors

Full Changelog: v2.6.0...v2.6.1