Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursor latency increased 600% after updating from 48 to 51 #14832

Open
2 tasks done
romantomjak opened this issue Nov 6, 2024 · 20 comments
Open
2 tasks done

Recursor latency increased 600% after updating from 48 to 51 #14832

romantomjak opened this issue Nov 6, 2024 · 20 comments
Labels

Comments

@romantomjak
Copy link

romantomjak commented Nov 6, 2024

Short description

Hello! 👋

First of all - big thanks for making this software. I just wanted to drop a quick note on
a possible performance regression after updating debian repo from 48 to 51.

Query latency increased from 4ms to about 35ms on average after the upgrade with
no changes in config from my side.

Environment

  • Operating system: Debian 6.1.0-26-amd64
  • Software version: 5.1.2
  • Software source: PowerDNS repository

Steps to reproduce

  1. Install recursor from 48 repo
  2. Run some queries and measure latency
  3. Install recursor from 51 repo
  4. The measured latency should now increase

Expected behaviour

I expected latency to roughly stay the same

Actual behaviour

Latency increased increased 600%

Other information

Before update latency hovers at 4ms
image

After update latency does not go below 15ms and peaks at 40ms
image

Answer buckets
image

Prometheus queries for the graphs:

avg_over_time(pdns_recursor_x_our_latency[$__rate_interval])
avg_over_time(pdns_recursor_qa_latency[$__rate_interval])

increase(pdns_recursor_answers0_1[$__rate_interval])
increase(pdns_recursor_answers1_10[$__rate_interval])
increase(pdns_recursor_answers10_100[$__rate_interval])
increase(pdns_recursor_answers100_1000[$__rate_interval])
increase(pdns_recursor_answers_slow[$__rate_interval])

recursor.conf

# See https://doc.powerdns.com/recursor/settings.html

local-address=...
allow-from=...

refresh-on-ttl-perc=10

max-negative-ttl=300

lua-dns-script=/var/lib/powerdns/adblock.lua

forward-zones-recurse=...

webserver=yes
webserver-address=0.0.0.0
webserver-allow-from=...
webserver-password=...

adblock.lua

driver = require "luasql.sqlite3"
env = assert( driver.sqlite3() )
con = assert(env:connect("/var/lib/powerdns/adblock.db"))

function preresolve(dq)
    domain = dq.qname:toStringNoDot()

    while domain ~= "" do
        local sth = assert(con:execute(string.format("SELECT 1 FROM domains WHERE name = '%s'", con:escape(domain))))
        if sth:fetch() then
            dq.variable = true -- disable packet cache
            dq.rcode = pdns.NXDOMAIN
            dq:addAnswer(pdns.TXT, "\"dropped by adblock\"", 1) -- for debugging
            return true
        end
        domain = domain:gsub("^[^.]*%.?", "")
    end

    return false
end
@phonedph1
Copy link
Contributor

What is measuring "Query latency"?

@romantomjak
Copy link
Author

Apologies, just updated the issue. This is measured by prometheus via the /metrics exposed by the powerdns webserver

@phonedph1
Copy link
Contributor

Any clues in the pdns_recursor_answers* metrics (or the auth4/auth6 separate ones)?

@omoerbeek
Copy link
Member

omoerbeek commented Nov 7, 2024

Please show your configuration.

It might a case of https://docs.powerdns.com/recursor/performance.html#threading-and-distribution-of-queries (specific ally the Imbalance section). Please show some lines of the reported periodic stats.

There were also changes in the way metrics are computed between 4.8 and 5.1, but we need more data to decide what is going on.

Do you have any measures of latency seen by clients?

@romantomjak
Copy link
Author

romantomjak commented Nov 7, 2024

phonedph1, looks like a lot more queries started to time out after the upgrade? I'm not sure.

omoerbeek, I don't measure latency seen by clients. Config is fairly minimal - I've updated the issue with the config I'm using. Note that I'm also using a lua script, but that's been there since 48.

I don't have any logs from the day I upgraded, but here are some from this morning:

Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.409" cache-entries="4643" negcache-entries="67" packetcache-acquired="10949788" packetcache-contended="13" packetcache-contended-perc="0.000000" questions="68853" record-cache-acquired="7220504" record-cache-contended="2228" record-cache-contended-perc="0.000000" record-cache-hitratio-perc="46"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.409" edns-entries="0" failed-host-entries="40" non-resolving-nameserver-entries="0" nsspeed-entries="351" outqueries-per-query="22" saved-parent-ns-sets-entries="1" throttle-entries="0"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" concurrent-queries="0" dot-outqueries="0" idle-tcpout-connections="0" outgoing-timeouts="1719" tcp-outqueries="6" throttled-queries-perc="1"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" packetcache-entries="104" packetcache-hitratio-perc="80" taskqueue-expired="33" taskqueue-pushed="1189" taskqueue-size="0"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Queries handled by thread" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" count="40587" thread="0" tname="worker"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Queries handled by thread" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" count="28266" thread="1" tname="worker"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Queries handled by thread" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" count="0" thread="2" tname="tcpworker"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic QPS report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" averagedOver="1800" qps="2"

@omoerbeek
Copy link
Member

omoerbeek commented Nov 7, 2024

I edited the stats to make the more readable. They look reasonable at first sight. Though failed-host-entries is a bit high and taskqueue-expired as well.

Can you push stats to our public metronome server?
Something like:

carbon-ourname=yourrecname
carbon-server=37.252.122.50

This allows us to see most relevant stats over time, often it reveals something.
Check https://metronome1.powerdns.com to see examples and your stats once enabled.

If that does not reveal anything I'm going to try to reproduce (that will be tomorrow earliest). I might need your sqlite DB for proper reproduction.

@omoerbeek
Copy link
Member

Oh, and please show rec_control top-timeouts

@romantomjak
Copy link
Author

Sure, I've set carbon-ourname to pdns.recursor.gh-issue-14832 👍

$ rec_control top-timeouts
Over last 2070 entries:
4.30%	205.251.193.96
4.06%	205.251.197.100
3.82%	205.251.193.127
3.57%	205.251.197.116
3.19%	205.251.192.166
2.80%	205.251.197.213
2.75%	205.251.192.248
2.61%	205.251.197.38
2.37%	205.251.196.72
2.32%	205.251.193.178
1.93%	205.251.192.153
1.74%	205.251.197.248
1.59%	205.251.192.159
1.55%	47.241.207.18
1.55%	205.251.196.174
1.50%	205.251.192.235
1.45%	205.251.196.48
1.40%	47.241.207.16
1.40%	205.251.197.41
1.30%	64.98.148.18

52.80%	rest

@romantomjak
Copy link
Author

I've uploaded the adblock.db to backblaze in case you need it

@mnordhoff
Copy link
Contributor

Almost all of those IPs — the ones in 205.251.192.0/21 — are from Amazon’s Route 53 DNS service. Do you know why it’s not responding reliably? AWS is not having a massive global DNS outage today.

For example, does “mtr -4 -r -w -c 60 205.251.193.96” work? (Note that Amazon’s network can be a bit funny and mtr or traceroute can have issues even if normal traffic works.)

Or what does “dig +norecurse slackb.com ns @205.251.193.96” output (picking a random domain hosted on that nameserver)?

I don’t think slackb.com is in that blocklist, but I might be mistaken, or maybe another domain is. This is wild speculation — and the last time I speculated wildly about something, I was wrong — but maybe some of the “timeouts” are related to how blocked queries are accounted for?

@omoerbeek
Copy link
Member

I don't see your server coming through yet on https://metronome1.powerdns.com/

@romantomjak
Copy link
Author

hmm, interesting. Im not getting anything back from that IP. I cancelled the mtr because there was no output

$ mtr -4 -r -w -c 60 205.251.193.96
Start: 2024-11-07T10:49:53+0000
^C

$ dig +norecurse slackb.com ns @205.251.193.96
;; communications error to 205.251.193.96#53: timed out
;; communications error to 205.251.193.96#53: timed out
^C

$ ping 205.251.193.96
PING 205.251.193.96 (205.251.193.96) 56(84) bytes of data.
^C
--- 205.251.193.96 ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 8190ms

@romantomjak
Copy link
Author

romantomjak commented Nov 7, 2024

omoerbeek I restarted the service but for some reason it didn't pick up the new config 🤔 I did a start/stop just now and looks like it appears on the list now!

Edit: no it doesn't. let me change it to something else

@romantomjak
Copy link
Author

romantomjak commented Nov 7, 2024

Ok, I think it didn't like dots in the name (which docs warn against: "Be careful not to include any dots in this setting, unless you know what you are doing." 😅

I changed it to just gh-issue-14832 and I can see it in the list now

@mnordhoff
Copy link
Contributor

It's unusual to be unable to dig or ping/mtr one of the world's larger DNS services. Do only some zones and nameservers fail or all of them?

Is there a firewall blocking ICMP, or intercepting some DNS traffic?

Or a BGP routing issue with Amazon?

Hard to imagine network issues being related to a PowerDNS upgrade, but it is odd.

@omoerbeek
Copy link
Member

Fixing the reachability of Route 53 servers is important of course.

My hypothesis is that the network issues already existed with the 4.8 recursor, but is now more visible in the metrics.
A cause could be https://docs.powerdns.com/recursor/settings.html#bypass-server-throttling-probability, resulting in more attempts at contacting the unreachable auths than before. I'll spend some time to confirm or refute that hypothesis tomorrow.

@romantomjak
Copy link
Author

My firewall is pretty basic and I'm not blocking any hosts explicitly. I can receive replies if I ping hosts on the internet. I'll dig a bit more to see which hosts exactly are failing.

Yeah, potentially. I've raised this with my ISP so they take a look too

@omoerbeek omoerbeek added the rec label Nov 8, 2024
@omoerbeek
Copy link
Member

omoerbeek commented Nov 8, 2024

Did a couple of test runs with both 4.8 and 5.1

  • With no auth IPs blocked
  • With some Route53 IPs blocked
  • With many Route53 auths blocked

I saw nothing like the increase in reported latency between 4.8 and 5.1 like you did. Actually, in all test runs 5.1 had a bit lower reported latency.

So I'm not sure what is causing your reported latency to be higher. Some guesses

  • Your Lua code matters (I did not test yet with your Lua script)
  • Your forward-zones-recurse=... config matters. Please show it!
  • For some other reason the unreachable auths have a bigger impact on measured latency in 5.1 than on 4.8, in manner not covered by my test. Maybe your query traffic is much different from my test set (I took a capture of the DNS traffic to a resolver in my local LAN as test set)
  • there was some other change in your setup that happened in about the same time with your recursor upgrade

Probably it's best to leave this for a moment until the routing to Route53 is fixed and see how that influences the reported latency.

@romantomjak
Copy link
Author

Sure, the forward-zones-recurse is set to forward-zones-recurse=home.arpa=127.0.0.2:53 which points to a powerdns authoritative server that handles my private zones.

I also did a quick test on rec_control top-timeouts output and I'm not getting any DNS replies from the whole 205.251.192.0/21. Thanks everyone for looking into this, but I'm starting to think this one might be for my ISP. I'll report back when I hear from them.

@omoerbeek
Copy link
Member

OK, in that case (forwarding to an authoritative server) you should normally use forward-zones and not forward-zones-recurse, though in your use-case the difference probably does not matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants