MLTransform by AnandInguva · Pull Request #26795 · apache/beam

damccorm

@AnandInguva AnandInguva changed the title [DRAFT] MLTransform and TFTProcessHandler MLTransform and TFTProcessHandler

Jun 1, 2023

@AnandInguva

pass types
Support Pyarrow schema
Artifact WIP

@AnandInguva

WIP on inferring types
Remove pyarrow implementation
Add MLTransformOutput
Refactor files

@AnandInguva

Fix artifacts code
Add more tests
fix lint erors
Change namespaces from ml_transform to transforms
Add doc strings
Add tests and refactor
Sort imports
Add metrics namespaces
Refactor

@AnandInguva

@AnandInguva

Make VarLenFeatureSpec as default
Refactoring

@AnandInguva

…d address PR comments

Add skip conditions for tests
Add test suite for tft tests
Try except in __init__.py
Remove imports from __init__
Add docstrings, refactor

@AnandInguva

Mock tensorflow_transform in pydocs
fix tft pypi name

Skip a test
Add step name
Update  supported versions of TFT

@AnandInguva

@AnandInguva

AnandInguva

@AnandInguva

@AnandInguva AnandInguva changed the title MLTransform and TFTProcessHandler MLTransform

Jun 26, 2023

@AnandInguva

damccorm

@AnandInguva

@AnandInguva

damccorm

aleksandr-dudko pushed a commit to aleksandr-dudko/beam that referenced this pull request

Jul 10, 2023
* Initial work on MLTransform and ProcessHandler

* Support for containers: List, Dict[str, np.ndarray]
pass types
Support Pyarrow schema
Artifact WIP

* Add  min, max, artifacts for scale_0_to_1

* Add more transform functions and artifacts
WIP on inferring types
Remove pyarrow implementation
Add MLTransformOutput
Refactor files

* Add generic type annotations

* Add unit tests
Fix artifacts code
Add more tests
fix lint erors
Change namespaces from ml_transform to transforms
Add doc strings
Add tests and refactor

* Add support for saving intermediate results for a transform
Sort imports
Add metrics namespaces
Refactor

* Add schema to the output PCollection

* Remove MLTransformOutput and return Row instead with schema

* Convert primitive type to list using a DoFn. Remove FixedLenFeatureSpec

Make VarLenFeatureSpec as default
Refactoring

* Add append_transform to the ProcessHandler
Some more refactoring

* Remove param self.has_artifacts, add artifact_location to handler..and address PR comments
Add skip conditions for tests
Add test suite for tft tests

* Move tensorflow import into the try except catch
Try except in __init__.py
Remove imports from __init__
Add docstrings, refactor

* Add type annotations for the data transforms

* Add tft test in tox.ini

Mock tensorflow_transform in pydocs
fix tft pypi name

Skip a test
Add step name
Update  supported versions of TFT

* Add step name for TFTProcessHandler

* Remove unsupported tft versions

* Fix mypy

* Refactor TFTProcessHandlerDict to TFTProcessHandlerSchema

* Update doc for data processing transforms

* Fix checking the typing container types

* Refactor code

* Fail TFTProcessHandler on a non-global window PColl

* Remove underscore

* Remove high level functions

* Add TFIDF

* Fix tests with new changes[WIP]

* Fix tests

* Refactor class name to CamelCase and remove kwrags

* use is_default instead of isinstance

* Remove falling back to staging location for artifact location

* Add TFIDF tests

* Remove __str__

* Refactor skip statement

* Add utils for fetching artifacts on compute and apply vocab

* Make ProcessHandler internal class

* Only run analyze stage when transform_fn(artifacts) is not computed before.

* Fail if pipeline has non default window during artifact producing stage

* Add support for Dict, recordbatch and introduce artifact_mode

* Hide process_handler from user. Make TFTProcessHandler as default

* Refactor few tests

* Comment a test

* Save raw_data_meta_data so that it can be used during consume stage

* Refactor code

* Add test on artifacts

* Fix imports

* Add tensorflow_metadata to pydocs

* Fix test

* Add TFIDF to import

* Add basic example

* Remove redundant logging statements

* Add test for multiple columns on MLTransform

* Add todo about what to do when new process handler is introduced

* Add abstractmethod decorator

* Edit Error message

* Update docs, error messages

* Remove record batch input/output arg

* Modify generic types

* Fix import sort

* Fix mypy errors - best effort

* Fix tests

* Add TFTOperation doc

* Rename tft_transform  to tft

* Fix hadler_test

* Fix base_test

* Fix pydocs

aleksandr-dudko pushed a commit to aleksandr-dudko/beam that referenced this pull request

Jul 17, 2023
* Initial work on MLTransform and ProcessHandler

* Support for containers: List, Dict[str, np.ndarray]
pass types
Support Pyarrow schema
Artifact WIP

* Add  min, max, artifacts for scale_0_to_1

* Add more transform functions and artifacts
WIP on inferring types
Remove pyarrow implementation
Add MLTransformOutput
Refactor files

* Add generic type annotations

* Add unit tests
Fix artifacts code
Add more tests
fix lint erors
Change namespaces from ml_transform to transforms
Add doc strings
Add tests and refactor

* Add support for saving intermediate results for a transform
Sort imports
Add metrics namespaces
Refactor

* Add schema to the output PCollection

* Remove MLTransformOutput and return Row instead with schema

* Convert primitive type to list using a DoFn. Remove FixedLenFeatureSpec

Make VarLenFeatureSpec as default
Refactoring

* Add append_transform to the ProcessHandler
Some more refactoring

* Remove param self.has_artifacts, add artifact_location to handler..and address PR comments
Add skip conditions for tests
Add test suite for tft tests

* Move tensorflow import into the try except catch
Try except in __init__.py
Remove imports from __init__
Add docstrings, refactor

* Add type annotations for the data transforms

* Add tft test in tox.ini

Mock tensorflow_transform in pydocs
fix tft pypi name

Skip a test
Add step name
Update  supported versions of TFT

* Add step name for TFTProcessHandler

* Remove unsupported tft versions

* Fix mypy

* Refactor TFTProcessHandlerDict to TFTProcessHandlerSchema

* Update doc for data processing transforms

* Fix checking the typing container types

* Refactor code

* Fail TFTProcessHandler on a non-global window PColl

* Remove underscore

* Remove high level functions

* Add TFIDF

* Fix tests with new changes[WIP]

* Fix tests

* Refactor class name to CamelCase and remove kwrags

* use is_default instead of isinstance

* Remove falling back to staging location for artifact location

* Add TFIDF tests

* Remove __str__

* Refactor skip statement

* Add utils for fetching artifacts on compute and apply vocab

* Make ProcessHandler internal class

* Only run analyze stage when transform_fn(artifacts) is not computed before.

* Fail if pipeline has non default window during artifact producing stage

* Add support for Dict, recordbatch and introduce artifact_mode

* Hide process_handler from user. Make TFTProcessHandler as default

* Refactor few tests

* Comment a test

* Save raw_data_meta_data so that it can be used during consume stage

* Refactor code

* Add test on artifacts

* Fix imports

* Add tensorflow_metadata to pydocs

* Fix test

* Add TFIDF to import

* Add basic example

* Remove redundant logging statements

* Add test for multiple columns on MLTransform

* Add todo about what to do when new process handler is introduced

* Add abstractmethod decorator

* Edit Error message

* Update docs, error messages

* Remove record batch input/output arg

* Modify generic types

* Fix import sort

* Fix mypy errors - best effort

* Fix tests

* Add TFTOperation doc

* Rename tft_transform  to tft

* Fix hadler_test

* Fix base_test

* Fix pydocs

cushon pushed a commit to cushon/beam that referenced this pull request

May 24, 2024
* Initial work on MLTransform and ProcessHandler

* Support for containers: List, Dict[str, np.ndarray]
pass types
Support Pyarrow schema
Artifact WIP

* Add  min, max, artifacts for scale_0_to_1

* Add more transform functions and artifacts
WIP on inferring types
Remove pyarrow implementation
Add MLTransformOutput
Refactor files

* Add generic type annotations

* Add unit tests
Fix artifacts code
Add more tests
fix lint erors
Change namespaces from ml_transform to transforms
Add doc strings
Add tests and refactor

* Add support for saving intermediate results for a transform
Sort imports
Add metrics namespaces
Refactor

* Add schema to the output PCollection

* Remove MLTransformOutput and return Row instead with schema

* Convert primitive type to list using a DoFn. Remove FixedLenFeatureSpec

Make VarLenFeatureSpec as default
Refactoring

* Add append_transform to the ProcessHandler
Some more refactoring

* Remove param self.has_artifacts, add artifact_location to handler..and address PR comments
Add skip conditions for tests
Add test suite for tft tests

* Move tensorflow import into the try except catch
Try except in __init__.py
Remove imports from __init__
Add docstrings, refactor

* Add type annotations for the data transforms

* Add tft test in tox.ini

Mock tensorflow_transform in pydocs
fix tft pypi name

Skip a test
Add step name
Update  supported versions of TFT

* Add step name for TFTProcessHandler

* Remove unsupported tft versions

* Fix mypy

* Refactor TFTProcessHandlerDict to TFTProcessHandlerSchema

* Update doc for data processing transforms

* Fix checking the typing container types

* Refactor code

* Fail TFTProcessHandler on a non-global window PColl

* Remove underscore

* Remove high level functions

* Add TFIDF

* Fix tests with new changes[WIP]

* Fix tests

* Refactor class name to CamelCase and remove kwrags

* use is_default instead of isinstance

* Remove falling back to staging location for artifact location

* Add TFIDF tests

* Remove __str__

* Refactor skip statement

* Add utils for fetching artifacts on compute and apply vocab

* Make ProcessHandler internal class

* Only run analyze stage when transform_fn(artifacts) is not computed before.

* Fail if pipeline has non default window during artifact producing stage

* Add support for Dict, recordbatch and introduce artifact_mode

* Hide process_handler from user. Make TFTProcessHandler as default

* Refactor few tests

* Comment a test

* Save raw_data_meta_data so that it can be used during consume stage

* Refactor code

* Add test on artifacts

* Fix imports

* Add tensorflow_metadata to pydocs

* Fix test

* Add TFIDF to import

* Add basic example

* Remove redundant logging statements

* Add test for multiple columns on MLTransform

* Add todo about what to do when new process handler is introduced

* Add abstractmethod decorator

* Edit Error message

* Update docs, error messages

* Remove record batch input/output arg

* Modify generic types

* Fix import sort

* Fix mypy errors - best effort

* Fix tests

* Add TFTOperation doc

* Rename tft_transform  to tft

* Fix hadler_test

* Fix base_test

* Fix pydocs