Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The data dog agent encountered this error, traces_dropped(payload_too_large:466059) #141

Open
nqf opened this issue Jul 10, 2024 · 9 comments

Comments

@nqf
Copy link

nqf commented Jul 10, 2024

What methods do we have to control the size of sending packets?

@dmehala
Copy link
Collaborator

dmehala commented Jul 12, 2024

Hi @nqf

Could you provide more details, such as the runtime environment, the connection with dd-trace-cpp, and how frequently this issue occurs? Additionally, to ensure we're on the same page, could you explain what the Datadog proxy is?

@dgoffredo
Copy link
Contributor

My guess is the Datadog proxy is the Datadog Agent, and "payload too large" refers to this behavior in the Agent.

Looks like the default limit is 25 MB, which is an awful lot of traces.

@nqf
Copy link
Author

nqf commented Jul 12, 2024

yes Datadog proxy mean Datadog Agent,We have a service that can process approximately 19000 requests per second,Now I am using a global tracer,I understand that it only creates an HTTP client to send span to the agent, right?

@nqf
Copy link
Author

nqf commented Jul 12, 2024

Now when our load reaches 10000/sec, this error will occur

@nqf nqf changed the title The data dog proxy encountered this error, traces_dropped(payload_too_large:466059) The data dog agent encountered this error, traces_dropped(payload_too_large:466059) Jul 12, 2024
@dgoffredo
Copy link
Contributor

Damien still needs to know which integration you're using, i.e. NGINX, Envoy, Istio, etc.

As a workaround, you can tell the tracing library to send payloads to the Agent more often, but that option does not have a corresponding environment variable. So, that would apply only if you're using dd-trace-cpp manually in C++ code.

@nqf
Copy link
Author

nqf commented Jul 16, 2024

Our application is implemented based on this example,The only difference is that our HTTP framework is not httplib
, By the way, we have already set it(flush_interval_milliseconds) up in the program, but it still happens, I am now planning to use multiple dd::Tracer
https://github.com/DataDog/dd-trace-cpp/blob/main/examples/http-server/server/server.cpp

@dgoffredo
Copy link
Contributor

Somebody actually used the example! That's good to hear.

If the large payloads are due to many different traces being included in a flush interval, then reducing flush_interval_milliseconds will help. For example, set it to 200 to send payloads ten times faster than the default (which is 2000). Then payloads will be, on average, ten times smaller. It depends on the traffic pattern, of course.

On the other hand, if the large payloads are due to individual traces that have many spans, then there is nothing you can configure to remedy this. dd-trace-cpp would have to be modified to break up its payloads, which is possible but not implemented.

@nqf
Copy link
Author

nqf commented Jul 16, 2024

Would it help to use multiple dd::tracer, I think multiple tracers would spread the pressure

@dgoffredo
Copy link
Contributor

By the way, we have already set it(flush_interval_milliseconds) up in the program, but it still happens

What value for flush_interal_milliseconds did you use?

Would it help to use multiple dd::tracer, I think multiple tracers would spread the pressure

I doubt it. It depends on the statistical distributions your application has for "traces per second" and for "spans per trace." If the issue is "traces per second," then decreasing flush_interval_milliseconds is the workaround. If the issue is "spans per trace," then decreasing flush_interval_milliseconds may help, but if your application has individual traces that are each on the order of 25 MB when serialized, there is no present workaround.

Multiple Tracer objects would imply multiple clients sending HTTP requests to the Datadog Agent. I don't see how that would be any better than decreasing flush_interval_milliseconds, and then additionally you'd have to manage which Tracer object to use for a particular service request.

The tracing library keeps track of certain telemetry metrics, but I'm not sure they can be used to infer the "distributions"
I referred to above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants