Fix infinite retry loops in flb_tls_net_read/write by spstack · Pull Request #11547 · fluent/fluent-bit
spstack
marked this pull request as ready for review
This set of changes addresses an issue where `flb_tls_net_read|write` functions can hang and consume 100% CPU. The issue occurs when a TLS connection is lost, and the underlying openssl implementation repeatedly returns `SSL_ERROR_WANT_READ|WRITE`. If no `io_timeout` is configured, then the thread will enter a tight infinite loop retrying the read/write indefinitely until the process is restarted. This can be addressed by setting net.io_timeout config setting to something other than the default, but this set of changes attempts to address the case where no default is specified. The solution here is to simply default to a high value for the timeout if the setting is zero. This does not modify the net.io_timeout value, and only applies to this set of functions. Reasoning is that there should not be a case where the user would want to spin forever here. This change also adds a small delay in between retries so that even for the timeout case, it doesn't load the CPU unnecessarily while waiting for the next bit of data. Signed-off-by: Scott Stack <scottstack14@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters