You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had a customer report an issue where the master would cease to attempt to connect to some agents randomly.
14:24:55 critical/ApiListener: Timeout while reconnecting to endpoint 'agent-123' via host 'agent-123.example.com' and port '5665', cancelling attempt
14:24:55 information/ApiListener: New client connection for identity 'agent-123.example' to [10.20.30.40]:5665
However, there we no following log messages for that agent with an "Operation canceled" error as it should happen in case of a timeout, suggesting some problem with the timeout mechanism.
My current theory is that cancel() doesn't really do what we'd need it to do here:
This function causes all outstanding asynchronous connect, send and receive operations to finish immediately, and the handlers for cancelled operations will be passed the boost::asio::error::operation_aborted error.
What happens if there's progress happening on the connection right when the timeout fires? Might there be no outstanding operation that could be cancelled, rending the timeout ineffective?
Customer seems to be happy since increasing the related timeout and reported no more issues, so that seems to confirm the issue being related to the timeout.
Possible fix: call shutdown() on the TCP layer instead.
ref/IP/44784
The text was updated successfully, but these errors were encountered:
We had a customer report an issue where the master would cease to attempt to connect to some agents randomly.
However, there we no following log messages for that agent with an "Operation canceled" error as it should happen in case of a timeout, suggesting some problem with the timeout mechanism.
My current theory is that
cancel()
doesn't really do what we'd need it to do here:icinga2/lib/remote/apilistener.cpp
Lines 588 to 598 in 894d6aa
Note that the documentation for
cancel()
says (emphasis by me):What happens if there's progress happening on the connection right when the timeout fires? Might there be no outstanding operation that could be cancelled, rending the timeout ineffective?
Customer seems to be happy since increasing the related timeout and reported no more issues, so that seems to confirm the issue being related to the timeout.
Possible fix: call
shutdown()
on the TCP layer instead.ref/IP/44784
The text was updated successfully, but these errors were encountered: