feat: keys last used by Flo4604 · Pull Request #5155 · unkeyed/unkey
What does this PR do?
This PR adds a new KeyLastUsedSyncService that synchronizes key last-used timestamps from ClickHouse to MySQL, along with supporting infrastructure for tracking when API keys were last used for verification.
Key Changes:
- New Restate Services:
KeyLastUsedSyncServiceorchestrator andKeyLastUsedPartitionServicethat read from ClickHousekey_last_used_v1table and batch-update MySQLkeys.last_used_atcolumn - Database Schema: Added
last_used_atcolumn to thekeystable in MySQL - ClickHouse Integration: New
GetKeyLastUsedBatchPartitionedmethod for paginated retrieval of key usage data with hash partitioning - API Response: Added
lastUsedfield toKeyResponseDatain the OpenAPI spec - Performance Optimized: Supports syncing millions of keys efficiently with partitioned workers, batched operations, and resumable cursors
The service uses Restate's virtual object pattern with composite cursors for resumability across invocations, ensuring no data loss if the sync process is interrupted. The orchestrator fans out to 8 partition workers that process disjoint hash partitions of the keyspace concurrently.
Explanation of why it works the way it does:
Problem
We use a MV to store the last used timestamps, this doesn't help any customer at all except if they use the analytics endpoint, otherwise its just data that exists.
Syncing this from CH to mysql is a bit of a pain tho that could've been solved in different ways:
-
We write a new MW that whenever we do a key verification buffers keys last used timestamps internally and then bulk writes to mysql in the background, this is high write volume esp since keys can hit diff nodes and then we will write the same keys at the same time. That means concurrent nodes taking row locks on overlapping key sets sounds like deadlocks waiting to happen.
-
We just write the keys that changed into mysql (which is what we are doing)
Basically we query CH last used MV with a simple where time >= lastSyncedTime and take all that and yeet it into mysql.
Why make it use partitions
Writing data in a loop sequentially sounded like a bottleneck in my head esp on the first sync or even just on large batches, so why not make use of parallel writes.
Why do we use a hash-partitioned fan-out?
Well because we dont want the same workers to look at the same keys basically, otherwise its just a waste of resources.
Each worker should only process what is assigned to it and the rest can do w/e they need to do, so they dont step on each others toes.
Basically each worker persists its own cursor and resumes from where it left off deterministically
Why cityHash64(key_id) % N? ClickHouse has cityHash64 as a built-in, fast hash function, by hashing the key_id using that and taking modulo N, we get:
Deterministic behavior, the same keys land in the same partitions, while we are at it we even distribute the keys evenly and we dont need to coordinate anything.
Also should help with mysql locks since each batch only contains that given key even when merges are in progress or w.e
Why separate Restate services instead of RunAsync closures?
So that each partition has its own Restate virtual object and state (KeyLastUsedPartitionService keyed by "0", "1", etc.)
That way each part can store its own state and it persists over invocations
Partitions can fail independently and we will only retry whatever failed, and they themselves also use batches
Why per-batch restate.Run instead of one big closure?
Well cuz I dont want to replay everything again and only the batch that fails and the rest..., now we just journal batches 1-N that didnt fail and start over with the one that failed and then continue with the rest
Idempotency
MySQL writes use WHERE last_used_at IS NULL OR last_used_at < ?, so replaying a batch that partially succeeded is safe... already-updated rows are skipped
Type of change
- New feature (non-breaking change which adds functionality)
- Enhancement (small improvements)
How should this be tested?
- Integration Tests: Run the existing test suite in
keylastusedsync/run_sync_handler_test.gowhich covers:- Basic sync functionality with 3 keys
- Idempotency (doesn't overwrite newer timestamps)
- Performance Tests: The performance test creates 1.5M keys with 100 events each (150M total events) and verifies the sync can handle large-scale data efficiently
- Incremental Tests: Verifies that partition cursors persist across invocations and only new data is processed
- Manual Testing:
- Create API keys and generate verification events
- Run the sync service via Restate ingress client
- Verify
last_used_attimestamps appear in MySQL and match ClickHouse data
- API Testing: Verify the new
lastUsedfield appears in key listing API responses
Checklist
Required
- Filled out the "How to test" section in this PR
- Read Contributing Guide
- Self-reviewed my own code
- Commented on my code in hard-to-understand areas
- Ran
pnpm build - Ran
pnpm fmt - Ran
make fmton/godirectory - Checked for warnings, there are none
- Removed all
console.logs - Merged the latest changes from main onto my branch with
git pull origin main - My changes don't cause any responsiveness issues
Appreciated
- Updated the Unkey Docs if changes were necessary