Prevent blocking on queue overflow (#1809) by saiya · Pull Request #1837

Prevent blocking on queue overflow (#1809) by saiya · Pull Request #1837 · census-instrumentation/opencensus-java

I implemented queue overflow handling to prevent blocking foreground (application) thread.

It drops event and show WARN log when queue get full.

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

saiya mentioned this pull request

Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

		@@ -131,7 +136,19 @@ private static DisruptorEventQueue create() {
		new DisruptorEnqueuer() {
		@Override
		public void enqueue(Entry entry) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may consider tracing this method too (not suggesting doing it in this PR though).

We need to evaluate the potential memory leak issue with this PR. Marked as DO NOT MERGE for now.

Thank you for checking this PR and quick response!

I organized thoughts about your concern and implemented some improvements to mitigate the issue in 43d7647 & 8b886b0. Could you check this? @songy23

potential memory leak issue

I checked subclasses of EventQueue.Entry. Problem is a SpanEndEvent#process because it calls runningSpanStore.onEnd to remove span from runningSpans. If we skip it, causes memory leak due to runningSpan. In my understanding, it is a cause for your concern.

My idea to mitigate the issue of runningSpans potential leak:

Idea 1-A) Call EventQueue.Entry#process() in foreground thread when queue get full. This idea seems bad because it may cause lock contention and severe performance issue.
Idea 1-B) Add EventQueue.Entry#rejected() method to implement minimum cleanup code. Call it in foreground thread when queue get full. Call runningSpanStore.onEnd from SpanEndEvent#rejected() to prevent memory leak.
Idea 2) Change InProcessRunningSpanStoreImpl to use WeakReference to prevent memory leak caused by it. This may cause additional memory usage due to allocating additional objects to hold reference to spans.

I feel Idea 1-B is feasible and effective way, so that implemented Idea 1-B in 43d7647 . It also prevent memory leak even if events come after queue shutdown (current code has such potential issue).

( And also InProcessSampledSpanStoreImpl.UnregisterSpanNameEvent seems to have similar potential issue. But the Map<String, PerSpanNameSamples> inside it should not be huge because its key is name of span, not span itself. So I feel it is not serious problem. )

Skipping SpanStartEvent

When we drop SpanStartEvent, it skips RunningSpanStoreImpl#onStart(span). So that RunningSpanStoreImpl might receive onEnd(span) without onStart(span).

Current only one implementation InProcessRunningSpanStoreImpl raises IllegalArgumentExeption in runningSpans.removeElement. In such case, do nothing is correct behavior in my understanding. So that I implemented exception handling in 8b886b0 .

Friendly ping @songy23 . Let me know if I can help in any way with getting this PR merged.

Apologies for getting back late to this PR and thanks for investigating the issue.

Idea 1-B) Add EventQueue.Entry#rejected() method to implement minimum cleanup code. Call it in foreground thread when queue get full. Call runningSpanStore.onEnd from SpanEndEvent#rejected() to prevent memory leak.

This sounds a reasonable approach to me. We may also extend EventQueue.Entry#rejected() to support measurement recording in Stats. @bogdandrutu @dinooliva @rghetia WDYT?

Hi @saiya, we would like to have a better understanding about the background of the issue. Could you describe a bit about how the thread blocking impacted your application?

(Depending on the impact we may want to take a different approach, e.g do some optimization to reduce the events put onto the queue.)

we would like to have a better understanding about the background of the issue. Could you describe a bit about how the thread blocking impacted your application?

Hi @songy23, here is the situation I am facing:

There are so many products/microservices in my company
- Over 100 systems
- There are complex dependency between API subsystems
- Estimated as ten billion HTTP API calls/month at least
Some product has traffic spikes
- It causes unexpectable spike for related API subsystems
Some important & huge services are running in on-premise environment
- Difficult for elastic scaling / frequent failover, so that we are running large instances to process traffic
All of our service are implemented with concurrent way (multi-thread or event driven)
- Blocking is not welcome (even if it is "possible" blocking)
- Blocking cause service down. It hurts profit & customer loyalty
We had used Zipkin, but maintaining Zipkin infrastructure is hard work so that considering OpenCensus + Stackdriver Trace

There are some alternative idea I can imagine for this issue, but not suitable for OpenTracing:

Use low sampling ratio (e.g. trace only 10%)
- Cannot capture random performance degrade
- Does not help debugging for complex bug due to missing trace
Use RateLimiting sampler
- I found the document today, it seems not supported by opencensus-java yet
- Because I want to avoid blocking as possible, I need to set conservative RateLimit.
  - It means low sampling ratio as same as idea 1.
Implement dynamic sampling to sample slow request only
- Does not provide uniform sampling, so that may cause problem for statistics
- Does not solve queue overflow because we need to start tracing before we know the request is slow

This is why I am interested with this queue overflow issue.

(Depending on the impact we may want to take a different approach, e.g do some optimization to reduce the events put onto the queue.)

Because large system (large scale microservice system) is chaotic, it is hard to determine "safe" sampling ratio / rate limit. If there are risk of blocking, I need to set very low sampling ratio. Even if we reduce events on the queue, the issue still alive.

In my working experience, many large service provider (not only current company) has similar needs. It is hard to introduce additional blocking risk only for tracing/instrumentation. If the tracing SDK does not have such risk, it is awesome. This is why I made this PR.

But reducing event is also wonderful because it prevent from dropping events.

Thank you for talking with this PR, feel free to ask me for anything I can share more.

Thank you @saiya for adding the detailed background, now we have a much better understanding about your use case. There're some concerns from my teammates, and also we want to make sure this behavior is consistent across all OC languages, so I filed census-instrumentation/opencensus-specs#262 for a broader discussion.

Hi @songy23 !

Currently this PR drops events if queue get full. But it calls runningSpanStore.onEnd(span); even on rejected() so it does not cause memory leak. I added corner-case handling to prevent memory leak in rare case also in a55ccbc.

And also I rebased this PR based on latest master branch (there are many refactoring to related classes :-) ).

I know there are discussion in census-instrumentation/opencensus-specs#262. Can such improvements be done in another branch/PR? Or should we improve this PR? (it might make this PR huge.) This PR prevents memory leak and shows warning log on queue overflow. I think this is minimal viable solution.

Hi @saiya, did you get a chance to try the 0.22.0 release? The new release included the fix proposed by @bogdandrutu (census-instrumentation/opencensus-specs#262 (comment)) and should reduce the possibility of queue blocking. #1893 implemented a similar approach on dropping spans.

@bogdandrutu had some concerns about silently dropping events. Maybe the fact that event queue got full and blocking is an indicator that sampling rate is too high. In this case having users explicitly reduce the sampling rate may be a better option. (IMO dropping queue events is actually equivalent to having a lower sampling rate.) Some quotes from the Dapper paper:

In practice, we have found that there is still an adequate amount of trace data for high-volume services when using a sampling rate as low as 1/1024.

The first production version of Dapper used a uniform sampling probability for all processes at Google, averaging one sampled trace for every 1024 candidates. This simple scheme was effective for our high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured.

Personally I think adaptive sampling (a.k.a rate-limited sampler) may be the silver bullet here. But before that happened, I think it's reasonable to make queue non-blocking and log a warning to let users know they may need to use a lower sampling rate.

Hi @songy23, thank you for kind response.

IMO dropping queue events is actually equivalent to having a lower sampling rate

Yes, I think so. And also I know that dropping events causes non-uniform sampling ratio (it sometimes drops, sometimes not drop). If user knows adequate sampling ratio, user should set it.

But the key point I want to emphasize is that it is hard to know proper sampling ratio for some kind of environment. In our case, some microservice amplify request (generate many outgoing API request for each one incoming request). And there are random user activity spikes. In such situation, hard to know good sampling ratio. If I set very pessimistic sampling ratio (such as 0.001%) to avoid blocking, it can easily ignore issues of some percent of users.

Personally I think adaptive sampling (a.k.a rate-limited sampler) may be the silver bullet here.

Yes, adaptive sampling is an ideal solution for this matter. But, IMHO, I feel it takes some days to implement it.

But before that happened, I think it's reasonable to make queue non-blocking and log a warning to let users know they may need to use a lower sampling rate.

Absolutely agree. It helps us to search good sampling ratio (rathe than very pessimistic one such as 0.001%). And we really need non blocking one to avoid stopping our service in production.

I can try 0.22.0 in non-important service in our company. But to try it in most hot service, non blocking is important.

But the key point I want to emphasize is that it is hard to know proper sampling ratio for some kind of environment. In our case, some microservice amplify request (generate many outgoing API request for each one incoming request). And there are random user activity spikes. In such situation, hard to know good sampling ratio. If I set very pessimistic sampling ratio (such as 0.001%) to avoid blocking, it can easily ignore issues of some percent of users.

Yes I was thinking about the same reason, and I agree in this scenario it's better to have drop-and-log rather than slow down users' whole application.

@bogdandrutu WDYT? The only alternative I see here is to always recommend a low sampling rate, but like @saiya said it has its own problem.

Hi @saiya, I chatted with @bogdandrutu about this PR today. Bogdan mentioned we'll provide the non-blocking option in the new OpenTelemetry client libraries. (As you may be aware of, we're currently migrating the OpenCensus project to the merged OpenTelemetry project, and OpenCensus will be put to maintenance mode afterwards.) Given that,

Do you need this feature urgently in OpenCensus, or can you wait for us implement it in OpenTelemetry?
If you need this feature now in OpenCensus, can you keep blocking as the default, and make non-blocking configurable through manifest/env vars/flags? Changing the default from blocking to non-blocking is kind of a breaking change, and this is something we want to avoid.
Have you got a chance to try out the v0.22.1 release? We have another customer facing a similar issue, and the latest release fixed it (Memory settings #1767).

@songy23 when do you expect there'll be a beta release for OpenTelemetry or a stable release

when do you expect there'll be a beta release for OpenTelemetry or a stable release

The current plan is to have API ready and published by the end of this month. SDK (a.k.a implementation) will be ready early next quarter.

Hi @songy23 , thank you for sharing current status.

Bogdan mentioned we'll provide the non-blocking option in the new OpenTelemetry client libraries.

Very good news! Are there any specification / code / PR / issue in OpenTelemetry repositories (e.g. documentation in https://github.com/open-telemetry/opentelemetry-specification) that clarify non-blocking option? Or can I send PR to any repository of OpenTelemetry?

Do you need this feature urgently in OpenCensus, or can you wait for us implement it in OpenTelemetry?

Not urgent. I can wait for some months (but not want to wait for years). In my case, I currently enabled OpenCensus in non-heavy-traffic systems only.

In my understanding, OpenTelemetry will provide OpenCensus shim. So that I can reuse my OpenCensus integration to use OpenTelemetry.

The current plan is to have API ready and published by the end of this month. SDK (a.k.a implementation) will be ready early next quarter.

Good news to hear it! I hope OpenTelemetry will provide SDK for Stackdriver Tracing in next quarter.

Have you got a chance to try out the v0.22.1 release? We have another customer facing a similar issue, and the latest release fixed it (#1767)

I am using 0.22.1 in not heavy system only. It is working well. But not enabled OpenCensus in heavy systems.

Are there any specification / code / PR / issue in OpenTelemetry repositories (e.g. documentation in https://github.com/open-telemetry/opentelemetry-specification) that clarify non-blocking option? Or can I send PR to any repository of OpenTelemetry?

Not yet since for now we're focusing on the API/data models. Please feel free to open an issue under https://github.com/open-telemetry/opentelemetry-specification.

I am using 0.22.1 in not heavy system only. It is working well. But not enabled OpenCensus in heavy systems.

Thanks for the info!

In my understanding, OpenTelemetry will provide OpenCensus shim. So that I can reuse my OpenCensus integration to use OpenTelemetry.

Exactly. The goal is OpenTelemetry will be 100% backwards-compatible with OpenCensus so the existing integration will continue to work.

saiya mentioned this pull request

Jun 12, 2019

saiya mentioned this pull request

Jun 17, 2019