Releases · cleanlab/cleanlab
v2.9.0
v2.9.0- Dependency Modernization and Simplified Maintenance
This release streamlines cleanlab's dependencies by removing TensorFlow/Keras support, reducing maintenance burden and improving compatibility with modern Python ecosystems. This is a breaking release for users relying on cleanlab.models.keras, who will need to migrate to alternative model wrappers, custom implementations or PyTorch instead.
What's Changed
- Remove Tensorflow from docs + code by @ulya-tkch in #1289
- Fix stale dependencies + add CI freshness checks by @ulya-tkch in #1291
- Make codebase compatible with latest dependencies by @ulya-tkch in #1292
- Update docs by @ulya-tkch in #1293
Full Changelog: v2.8.0...v2.9.0
v2.8.0
v2.8.0 - Python 3.12-3.14 support and compatibility improvements
This release updates Python version support to 3.10-3.14, resolves compatibility issues with Datasets 4.0.0+, relaxes NumPy version constraints for better flexibility, and streamlines documentation. The update ensures cleanlab works with current and future Python releases while maintaining backward compatibility for supported versions.
What's Changed
- Update license by @jwmueller in #1263
- Drop Python versions 3.8 & 3.9 that have reached EOL by @elisno in #1268
- Extend package support Python 3.10-3.14 by @ulya-tkch in #1276
- Handle new Column types from Datasets 4.0.0+ in Datalab by @elisno in #1267
- Remove agility chat by @maxbuchan in #1250
- Update README.md links to Cleanlab blog by @maxbuchan in #1257
- Update documentation to remove Easy Mode sections and simplify README by @jwmueller in #1265, #1262
- Shorten improving ML tutorial by @jwmueller in #1266
- Fix docs for 2.8.0 release by @ulya-tkch in #1278
- Update docs landing page to remove pointers we no longer want by @jwmueller in #1280
- Shorten FAQ and Update support CTA at end of FAQ by @jwmueller in #1264, #1281
Full Changelog: v2.7.1...v2.8.0
v2.7.1 -- New issue manager and improved docs
This release is non-breaking when upgrading from v2.7.0, mostly focused on documentation and testing improvements. The most notable update is a:
- New identifier column issue manager – detects sequential numerical columns that might influence your model. This feature is available as a preview and requires additional setup to use with Datalab.
Other Updates:
- 📖 Docs & Readme: Improved clarity.
- 🛠 Test suite: More stability and consistency.
What's Changed
- Added issue manager for detecting identifier columns by @MaxJoas in #1120
- Revised non-IID section in Datalab tutorial to show overall dataset score first before per-example insights by @gordon-lim in #1221
- Fixed numpy2 compatibility by @GaetanLepage in #1224
- Updated CI to macos-13 instead of 12 by @jwmueller in #1219
- Test improvements by @gordon-lim in #1218; @jwmueller in #1220
- General updates to docs and docs build system by @misteroh in #1212, #1213, #1216; @maxbuchan in #1226, #1227; @elisno in #1208, #1231
New Contributors
- @MaxJoas made their first contribution in #1120
- @misteroh made their first contribution in #1212
- @GaetanLepage made their first contribution in #1224
- @maxbuchan made their first contribution in #1226
Full Changelog: v2.7.0...v2.7.1
v2.7.0 -- Broadening Data Quality Checks and ML Workflows
This release introduces new features and improvements aimed at helping users detect complex dataset issues and improve their ML models' robustness. As always, we maintain backward compatibility, making this release non-breaking when upgrading from v2.6.6. We continue to support Python 3.8-3.11 in this version, but support for Python 3.8 will be dropped in a future minor release.
Introducing Spurious Correlation Detection in Datalab
With this release, Datalab now detects spurious correlations in image datasets by default, helping users identify potentially misleading patterns that may lead to overfitting or reduced model generalization.
Spurious correlations occur when models pick up on patterns in the data that are coincidental rather than meaningful. For example, a model might incorrectly associate the background color with a particular label, leading to poor generalization on new data. Identifying these correlations helps ensure more reliable models by minimizing the risk of learning from irrelevant or misleading features.
Detecting spurious correlations in image datasets is straightforward:
from cleanlab import Datalab lab = Datalab(data=image_dataset, label_name="label_column", image_key="image_column") lab.find_issues() lab.report()
You can find a more detailed workflow for finding spurious correlations in our documentation.
This new issue type aims to give users deeper insights into their data, enabling more robust model development.
New Tutorial: Improving ML Performance with Train and Test Set Curation
We've introduced a new tutorial that demonstrates how to carefully use cleanlab (via Datalab) for both training and test data. This approach helps ensure reliable ML model training and evaluation, particularly for noisy datasets.
You can find this tutorial in our documentation: Improving ML Performance via Data Curation with Train vs Test Splits.
Other Major Improvements
- Optimized Internal Functions: Several internal optimizations have been made, including updates to
clip_noise_rates,remove_noise_from_class, andclip_valuesfunctions, improving the overall efficiency of cleanlab. - Improved Underperforming Group Detection: Enhanced scoring for all underperforming groups, providing more accurate identification of problematic data subsets.
If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!
Change Log
Significant changes in this release include:
- Added Spurious Correlation feature by @allincowell in #1140, #1171, #1181, #1194; @elisno in #1170, #1192, #1192, #1193, #1192, #1201; @jwmueller in #1195, #1196
- Added new CLOS train test split tutorial notebook by @mturk24 in #1071; @jwmueller in #1178
- Update links to Issue Type Guide in workflows tutorials by @elisno in #1168
- Optimize internal clip_noise_rates and remove_noise_from_class functions by @gogetron in #1105
- Optimize internal clip_values function by @gogetron in #1104
- Move models.fasttext wrapper to examples repo by @jwmueller in #1173
- Mypy fixes by @elisno in #1174
- Improve tests in Datalab Quickstart tutorial by @allincowell in #1166
- Improve docs by @mturk24 in #1177; @jwmueller in #1189; @dduong1603 in #1197; @elisno in #1204
- Update Studio References by @nelsonauner in #1182
- Update README by @nelsonauner in #1188
- Improve cluster score for all underperforming groups by @tataganesh in #1180
- Improve CI test setup by @dduong1603 in #1198
New Contributors
- @dduong1603 made their first contribution in #1197
For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.
v2.6.6
What's Changed
- Improvements in Issue Type guide by @elisno in #1100; @jwmueller in #1136
- Improve docstrings in token_classification/summary.py by @gogetron in #1094
- Update dictionary for deciding on omitting underperforming_group_check by @elisno in #1135
- Add notebook with miscellaneous Datalab workflows by @elisno in #1125, #1138
- Update datalab report text by @jwmueller in #1134; @elisno in #1154
- Update FAQ sections by @jwmueller in #1139; @elisno in #1152
- Pin fasttext in CI by @elisno in #1144
- Improve test setup by @elisno in #1146
- Update quickstart links that were outdated by @jwmueller in #1148
- Update knn shapely score computation by @elisno in #1142
- Refactor KNN graph handling and outlier detection in issue managers by @elisno in #1155, #1163
Full Changelog: v2.6.5...v2.6.6
v2.6.5
What's Changed
- Add end-to-end tests at the end of Datalab quickstart tutorial by @allincowell in #1118
- Centralize existing functionality for constructing and correcting knn graphs in a separate module by @elisno in #1117, #1119, #1129
- Optimize multiannotator.py for performance by @gogetron in #1077
- Optimize value_counts function for performance improvement with missing classes by @gogetron in #1073
- Improve test coverage for setting confident joint in
CleanLearningby @elisno in #1123 - Switch from np.isnan to pd.isna for null value check by @gogetron in #1096
- Update pip install instruction in object detection tutorial by @elisno in #1126
- Refine handling of
underperforming_groupissue type by @gogetron in #1099 - Improve compatibility with sklearn 1.5 by removing the deprecated
multi_classargument in LogisticRegression by @elisno in #1124 - Display exact duplicate sets dynamically in tabular tutorial by @nelsonauner in #1128
New Contributors
- @allincowell made their first contribution in #1118
- @nelsonauner made their first contribution in #1128
Full Changelog: v2.6.4...v2.6.5
v2.6.4
What's Changed
- Various performance optimizations and test improvements by @gogetron in #1064, #1067, #1079, #1087, #1095, #1106, #1107
- Restructured text and tabular classification tutorials into CleanLear… by @mturk24 in #1066
- user-facing cleanlab.datavaluation module by @coding-famer in #1050
- fix typo in datalab issue types by @coding-famer in #1085
- Add kwargs to functions that call plt.show() by @mturk24 in #1084; by @jwmueller in #1088
- update tutorials by @jwmueller in #1089, #1090, #1091
- Refine type hints by @desboisGIT in #1101; by @elisno in #1086
- Updated datalab issue type description for non iid issue by @mturk24 in #1102
- Remove unsqueeze call in image tutorial by @elisno in #1108
- Temporarily Revert to macOS 12 in CI due to Incompatibility with Python 3.8 and 3.9 by @elisno in #1110
- Fix numerical instability with Euclidean distance metric by @elisno in #1113
- avoid sensitive divisions by @jwmueller in #1114; by @elisno in #1116
- All identical datasets tests by @elisno in #1115
New Contributors
- @gogetron made their first contribution in #1064
- @desboisGIT made their first contribution in #1101
Full Changelog: v2.6.3...v2.6.4
v2.6.3 - Enhanced scores for outliers and near-duplicates
This release is non-breaking when upgrading from v2.6.2.
What's Changed
- Updated image_key documentation by @sanjanag in #1048
- Refine Scoring and Enhance Stability for Datasets with Identical Examples by @elisno in #1056
- Add warning message about TensorFlow compatibility to docs by @elisno in #1057
Full Changelog: v2.6.2...v2.6.3
v2.6.2
This release is non-breaking when upgrading from v2.6.1.
What's Changed
Full Changelog: v2.6.1...v2.6.2
v2.6.1 -- Refined Regression Score and Fixes
This release is non-breaking when upgrading from v2.6.0. Some noteworthy updates include:
- The label quality score in the
cleanlab.regressionmodule is improved to be more human-readable.- This only involves rescaling the scores to display a more human-interpretable range of scores, without affecting how your data points are ranked within a dataset according to these scores.
- Better address some edge-cases in
Datalab.get_issues().
What's Changed
- Readme updates by @jwmueller in #1030, #1031, #1039; @elisno in #1040
- Adjust the range of regression label quality scores by @huiwengoh in #1032
- Misc fixes of get_issues method by @elisno in #1025, #1026, #1028
- Support features as input for data valuation check in Datalab by @elisno in #1023
- Fix/clarify docs by @mturk24 in #1029; @elisno in #1024, #1037
- CI/CD changes by @elisno in #1036
New Contributors
Full Changelog: v2.6.0...v2.6.1