feat: Add support for NDU storages by vdusek · Pull Request #1401 · apify/crawlee-python
…ervices (#1386) ### Description This is a collection of closely related changes that are hard to separate from one another. The main purpose is to enable flexible storage use across the code base without unexpected limitations and limit unexpected side effects in global services. #### Top-level changes: - There can be multiple crawlers with different storage clients, configurations, or event managers. (Previously, this would cause `ServiceConflictError`) - `StorageInstanceManager` allows for similar but different storage instances to be used at the same time(Previously, similar storage instances could be incorrectly retrieved instead of creating a new storage instance). - Differently configured storages can be used at the same time, even the storages that are using the same `StorageClient` and are different only by using different `Configuration`. - `Crawler` can no longer cause side effects in the global service_locator (apart from adding new instances to `StorageInstanceManager`). - Global `service_locator` can be used at the same time as local instances of `ServiceLocator` (for example, each Crawler has its own `ServiceLocator` instance, which does not interfere with the global service_locator.) - Services in `ServiceLocator` can be set only once. Any attempt to reset them will throw an Error. Not setting the services and using them is possible. That will set services in `ServiceLocator` to some implicit default, and it will log warnings as implicit services can lead to hard-to-predict code. The preferred way is to set services explicitly. Either manually or through some helper code, for example, through `Actor`. [See related PR](apify/apify-sdk-python#576) #### Implementation notes: - Storage caching now supports all relevant ways to distinguish storage instances. Apart from generic parameters like `name`, `id`, `storage_type`, `storage_client_type`, there is also an `additional_cache_key`. This can be used by the `StorageClient` to define a unique way to distinguish between two similar but different instances. For example, `FileSystemStorageClient` depends on `Configuration.storage_dir`, which is included in the custom cache key for `FileSystemStorageClient`, but this is not true for `MemoryStorageClient` as the `storage_dir` is not relevant for it, see example: (This `additional_cache_key` could possibly be used for caching of NDU in #1401) ```python storage_client = FileSystemStorageClient() d1= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) d2= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path2")) d3= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) assert d2 is not d1 assert d3 is d1 storage_client_2 =MemoryStorageClient() d4= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path1")) d5= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path2")) assert d4 is d5 ``` - Each crawler will create its own instance of `ServiceLocator`. It will either use explicitly passed services(configuration, storage client, event_manager) to crawler init or services from the global `service_locator` as implicit defaults. This allows multiple differently configured crawlers to work in the same code. For example: ```python custom_configuration_1 = Configuration() custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1) custom_storage_client_1 = MemoryStorageClient() custom_configuration_2 = Configuration() custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2) custom_storage_client_2 = MemoryStorageClient() crawler_1 = BasicCrawler( configuration=custom_configuration_1, event_manager=custom_event_manager_1, storage_client=custom_storage_client_1, ) crawler_2 = BasicCrawler( configuration=custom_configuration_2, event_manager=custom_event_manager_2, storage_client=custom_storage_client_2, ) # use crawlers without runtime crash... ``` - `ServiceLocator` is now way more strict when it comes to setting the services. Previously, it allowed changing services until some service had `_was_retrieved` flag set to `True`. Then it would throw a runtime error. This led to hard-to-predict code as the global `service_locator` could be changed as a side effect from many places. Now the services in `ServiceLocator` can be set only once, and the side effects of attempting to change the services are limited as much as possible. Such side effects are also accompanied by warning messages to draw attention to code that could cause RuntimeError. ### Issues Closes: #1379 Connected to: - #1354 (through necessary changes in `StorageInstanceManagaer`) - apify/apify-sdk-python#513 (through necessary changes in `StorageInstanceManagaer` and storage clients/configuration related changes in `service_locator`) ### Testing - New unit tests were added. - Tested on the `Apify` platform together with SDK changes in [related PR](apify/apify-sdk-python#576) --------- Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>