chore: refactor tp cache usage and collect metrics by Jayachand · Pull Request #4914 · rudderlabs/rudder-transformer

Conversation

@Jayachand

What are the changes introduced in this PR?

  • functionality to off tp node cache (relying on cloud front cache)
  • collect metrics on keys and size on node cache present in tracking plan flow

What is the related Linear task?

Resolves INT-XXX

Please explain the objectives of your changes below

Put down any required details on the broader aspect of your changes. If there are any dependent changes, mandatorily mention them here

Any changes to existing capabilities/behaviour, mention the reason & what are the changes ?

N/A

Any new dependencies introduced with this change?

N/A

Any new generic utility introduced or modified. Please explain the changes.

N/A

Any technical or performance related pointers to consider with the change?

N/A

@coderabbitai review


Developer checklist

  • My code follows the style guidelines of this project

  • No breaking changes are being introduced.

  • All related docs linked with the PR?

  • All changes manually tested?

  • Any documentation changes needed with this change?

  • Is the PR limited to 10 file changes?

  • Is the PR limited to one linear task?

  • Are relevant unit and component test-cases added in new readability format?

Reviewer checklist

  • Is the type of change in the PR title appropriate as per the changes?

  • Verified that there are no credentials or confidential data exposed with the changes.

@coderabbitai

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'auto_resolve_threads'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Summary by CodeRabbit

Release Notes

  • New Features

    • Implemented event validation caching for improved validation performance
    • Added optional tracking plan caching controlled via USE_TP_CACHE feature flag
  • Monitoring

    • Introduced cache performance metrics tracking keys count and memory usage for enhanced observability

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

This PR implements caching optimization and observability improvements for event validation and tracking plan retrieval. It introduces a validator caching mechanism controlled by a feature flag, adds cache metrics instrumentation across multiple modules, refactors the validation error transformation into a dedicated helper function, and updates the validation flow to leverage cached validators when available.

Changes

Cohort / File(s) Summary
Event Validation Core
src/util/eventValidation.js
Adds getCachedValidatorFunction() to retrieve pre-compiled validators from cache. Extracts eventType and eventName from incoming events for consistent schema lookup. Introduces validateEventAndTransformErrors() helper to consolidate AJV validation and error normalization. Refactors validation flow to use cached validators when available, bypassing redundant schema fetches. Emits cache metrics (keys and memory usage) after cache operations.
Metrics Infrastructure
src/util/prometheus.js
Adds two new gauge metrics: node_cache_keys and node_cache_memory_usage_bytes, both with a name label, to track cache state across the application.
Tracking Plan Caching
src/util/trackingPlan.js
Introduces USE_TP_CACHE feature flag (default true) to conditionally enable tracking plan caching. When enabled, caches retrieved plans and emits cache metrics with tp_cache label. When disabled, always performs network fetch.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggestion for improvement: Consider adding brief inline comments in eventValidation.js explaining the relationship between eventType/eventName extraction and cache key construction, as this is central to the caching strategy. Additionally, document the expected behavior when USE_TP_CACHE is disabled—specifically whether callers should expect consistent performance characteristics or if there are implications for metrics tracking.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main changes: refactoring tp cache usage and adding metrics collection for caching instrumentation.
Description check ✅ Passed The description addresses the key changes but lacks detail on implementation specifics and leaves the Linear task reference incomplete (INT-XXX placeholder).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/util/eventValidation.js`:
- Around line 115-118: getCachedValidatorFunction currently hardcodes
isDraft4=false which causes cache key collisions between draft-04 and draft-19
schemas; update the flow so the runtime-determined isDraft4 value (computed from
the schema's $schema) is passed into eventSchemaHash and into
getCachedValidatorFunction (or remove the early cache lookup before schema is
fetched) and adjust eventSchemaCache lookups to include the isDraft4 flag in the
cache key (symbols: getCachedValidatorFunction, eventSchemaHash,
eventSchemaCache, isDraft4).
🧹 Nitpick comments (1)
src/util/eventValidation.js (1)

1-1: Consider restructuring to avoid ESLint disable.

The @typescript-eslint/no-use-before-define rule is disabled to allow validateEventAndTransformErrors to be called at line 143 before its definition at line 225. While function declarations are hoisted in JavaScript, you could alternatively move the function definition above validate() to eliminate the need for this disable directive.

Comment on lines +115 to +118

function getCachedValidatorFunction(tpId, tpVersion, eventType, eventName) {
const schemaHash = eventSchemaHash(tpId, tpVersion, eventType, eventName, false);
return eventSchemaCache.get(schemaHash);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential cache key collision between draft-04 and draft-19 schemas.

getCachedValidatorFunction hardcodes isDraft4 to false (line 116), and the main validation flow at line 179 also uses the default (false). However, the actual isDraft4 value is determined at runtime (lines 168-174) based on the schema's $schema property.

This means a draft-04 schema will be compiled with ajv4 but cached with a key ending in ::19. A subsequent request for the same event with a draft-19 schema (or vice versa) could retrieve the wrong validator.

Consider passing the determined isDraft4 value to the schema hash:

-    const schemaHash = eventSchemaHash(trackingPlanId, trackingPlanVersion, eventType, eventName);
+    const schemaHash = eventSchemaHash(trackingPlanId, trackingPlanVersion, eventType, eventName, isDraft4);

This would also require updating getCachedValidatorFunction to handle both draft versions or removing the early cache check since isDraft4 isn't known until the schema is fetched.

🤖 Prompt for AI Agents
In `@src/util/eventValidation.js` around lines 115 - 118,
getCachedValidatorFunction currently hardcodes isDraft4=false which causes cache
key collisions between draft-04 and draft-19 schemas; update the flow so the
runtime-determined isDraft4 value (computed from the schema's $schema) is passed
into eventSchemaHash and into getCachedValidatorFunction (or remove the early
cache lookup before schema is fetched) and adjust eventSchemaCache lookups to
include the isDraft4 flag in the cache key (symbols: getCachedValidatorFunction,
eventSchemaHash, eventSchemaCache, isDraft4).

@codecov

Codecov Report

❌ Patch coverage is 96.96970% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.26%. Comparing base (efa9455) to head (9dd3f02).
⚠️ Report is 133 commits behind head on develop.

Files with missing lines Patch % Lines
src/util/eventValidation.js 96.49% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #4914   +/-   ##
========================================
  Coverage    92.25%   92.26%           
========================================
  Files          654      654           
  Lines        35424    35447   +23     
  Branches      8328     8346   +18     
========================================
+ Hits         32682    32705   +23     
+ Misses        2526     2506   -20     
- Partials       216      236   +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sonarqubecloud

@devops-github-rudderstack

This PR is considered to be stale. It has been open for 20 days with no further activity thus it is going to be closed in 7 days. To avoid such a case please consider removing the stale label manually or add a comment to the PR.

@devops-github-rudderstack

This PR is considered to be stale. It has been open for 20 days with no further activity thus it is going to be closed in 7 days. To avoid such a case please consider removing the stale label manually or add a comment to the PR.

@github-actions

This PR is considered to be stale. It has been open for 20 days with no further activity thus it is going to be closed in 7 days. To avoid such a case please consider removing the stale label manually or add a comment to the PR.

2 participants

@Jayachand @devops-github-rudderstack