Recursor latency increased 600% after updating from 48 to 51 #14832

romantomjak · 2024-11-06T23:41:34Z

This is not a support question, I have read about opensource and will send support questions to the IRC channel, GitHub Discussions or the mailing list.
I have read and understood the 'out in the open' support policy
Program: Recursor
Issue type: Bug report

Short description

Hello! 👋

First of all - big thanks for making this software. I just wanted to drop a quick note on
a possible performance regression after updating debian repo from 48 to 51.

Query latency increased from 4ms to about 35ms on average after the upgrade with
no changes in config from my side.

Environment

Operating system: Debian 6.1.0-26-amd64
Software version: 5.1.2
Software source: PowerDNS repository

Steps to reproduce

Install recursor from 48 repo
Run some queries and measure latency
Install recursor from 51 repo
The measured latency should now increase

Expected behaviour

I expected latency to roughly stay the same

Actual behaviour

Latency increased increased 600%

Other information

Before update latency hovers at 4ms

After update latency does not go below 15ms and peaks at 40ms

Answer buckets

Prometheus queries for the graphs:

avg_over_time(pdns_recursor_x_our_latency[$__rate_interval])
avg_over_time(pdns_recursor_qa_latency[$__rate_interval])

increase(pdns_recursor_answers0_1[$__rate_interval])
increase(pdns_recursor_answers1_10[$__rate_interval])
increase(pdns_recursor_answers10_100[$__rate_interval])
increase(pdns_recursor_answers100_1000[$__rate_interval])
increase(pdns_recursor_answers_slow[$__rate_interval])

recursor.conf

# See https://doc.powerdns.com/recursor/settings.html

local-address=...
allow-from=...

refresh-on-ttl-perc=10

max-negative-ttl=300

lua-dns-script=/var/lib/powerdns/adblock.lua

forward-zones-recurse=...

webserver=yes
webserver-address=0.0.0.0
webserver-allow-from=...
webserver-password=...

adblock.lua

driver = require "luasql.sqlite3"
env = assert( driver.sqlite3() )
con = assert(env:connect("/var/lib/powerdns/adblock.db"))

function preresolve(dq)
    domain = dq.qname:toStringNoDot()

    while domain ~= "" do
        local sth = assert(con:execute(string.format("SELECT 1 FROM domains WHERE name = '%s'", con:escape(domain))))
        if sth:fetch() then
            dq.variable = true -- disable packet cache
            dq.rcode = pdns.NXDOMAIN
            dq:addAnswer(pdns.TXT, "\"dropped by adblock\"", 1) -- for debugging
            return true
        end
        domain = domain:gsub("^[^.]*%.?", "")
    end

    return false
end

The text was updated successfully, but these errors were encountered:

phonedph1 · 2024-11-07T00:03:38Z

What is measuring "Query latency"?

romantomjak · 2024-11-07T00:13:50Z

Apologies, just updated the issue. This is measured by prometheus via the /metrics exposed by the powerdns webserver

phonedph1 · 2024-11-07T00:22:39Z

Any clues in the pdns_recursor_answers* metrics (or the auth4/auth6 separate ones)?

omoerbeek · 2024-11-07T06:34:09Z

Please show your configuration.

It might a case of https://docs.powerdns.com/recursor/performance.html#threading-and-distribution-of-queries (specific ally the Imbalance section). Please show some lines of the reported periodic stats.

There were also changes in the way metrics are computed between 4.8 and 5.1, but we need more data to decide what is going on.

Do you have any measures of latency seen by clients?

romantomjak · 2024-11-07T07:29:06Z

phonedph1, looks like a lot more queries started to time out after the upgrade? I'm not sure.

omoerbeek, I don't measure latency seen by clients. Config is fairly minimal - I've updated the issue with the config I'm using. Note that I'm also using a lua script, but that's been there since 48.

I don't have any logs from the day I upgraded, but here are some from this morning:

Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.409" cache-entries="4643" negcache-entries="67" packetcache-acquired="10949788" packetcache-contended="13" packetcache-contended-perc="0.000000" questions="68853" record-cache-acquired="7220504" record-cache-contended="2228" record-cache-contended-perc="0.000000" record-cache-hitratio-perc="46"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.409" edns-entries="0" failed-host-entries="40" non-resolving-nameserver-entries="0" nsspeed-entries="351" outqueries-per-query="22" saved-parent-ns-sets-entries="1" throttle-entries="0"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" concurrent-queries="0" dot-outqueries="0" idle-tcpout-connections="0" outgoing-timeouts="1719" tcp-outqueries="6" throttled-queries-perc="1"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic statistics report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" packetcache-entries="104" packetcache-hitratio-perc="80" taskqueue-expired="33" taskqueue-pushed="1189" taskqueue-size="0"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Queries handled by thread" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" count="40587" thread="0" tname="worker"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Queries handled by thread" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" count="28266" thread="1" tname="worker"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Queries handled by thread" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" count="0" thread="2" tname="tcpworker"
Nov 07 07:13:30 router01 pdns-recursor[86665]: msg="Periodic QPS report" subsystem="stats" level="0" prio="Info" tid="0" ts="1730963610.411" averagedOver="1800" qps="2"

omoerbeek · 2024-11-07T07:41:11Z

I edited the stats to make the more readable. They look reasonable at first sight. Though failed-host-entries is a bit high and taskqueue-expired as well.

Can you push stats to our public metronome server?
Something like:

carbon-ourname=yourrecname
carbon-server=37.252.122.50

This allows us to see most relevant stats over time, often it reveals something.
Check https://metronome1.powerdns.com to see examples and your stats once enabled.

If that does not reveal anything I'm going to try to reproduce (that will be tomorrow earliest). I might need your sqlite DB for proper reproduction.

omoerbeek · 2024-11-07T07:48:24Z

Oh, and please show rec_control top-timeouts

romantomjak · 2024-11-07T08:09:26Z

Sure, I've set carbon-ourname to pdns.recursor.gh-issue-14832 👍

$ rec_control top-timeouts
Over last 2070 entries:
4.30%	205.251.193.96
4.06%	205.251.197.100
3.82%	205.251.193.127
3.57%	205.251.197.116
3.19%	205.251.192.166
2.80%	205.251.197.213
2.75%	205.251.192.248
2.61%	205.251.197.38
2.37%	205.251.196.72
2.32%	205.251.193.178
1.93%	205.251.192.153
1.74%	205.251.197.248
1.59%	205.251.192.159
1.55%	47.241.207.18
1.55%	205.251.196.174
1.50%	205.251.192.235
1.45%	205.251.196.48
1.40%	47.241.207.16
1.40%	205.251.197.41
1.30%	64.98.148.18

52.80%	rest

romantomjak · 2024-11-07T08:24:53Z

I've uploaded the adblock.db to backblaze in case you need it

mnordhoff · 2024-11-07T08:44:40Z

Almost all of those IPs — the ones in 205.251.192.0/21 — are from Amazon’s Route 53 DNS service. Do you know why it’s not responding reliably? AWS is not having a massive global DNS outage today.

For example, does “mtr -4 -r -w -c 60 205.251.193.96” work? (Note that Amazon’s network can be a bit funny and mtr or traceroute can have issues even if normal traffic works.)

Or what does “dig +norecurse slackb.com ns @205.251.193.96” output (picking a random domain hosted on that nameserver)?

I don’t think slackb.com is in that blocklist, but I might be mistaken, or maybe another domain is. This is wild speculation — and the last time I speculated wildly about something, I was wrong — but maybe some of the “timeouts” are related to how blocked queries are accounted for?

omoerbeek · 2024-11-07T09:04:01Z

I don't see your server coming through yet on https://metronome1.powerdns.com/

romantomjak · 2024-11-07T10:55:50Z

hmm, interesting. Im not getting anything back from that IP. I cancelled the mtr because there was no output

$ mtr -4 -r -w -c 60 205.251.193.96
Start: 2024-11-07T10:49:53+0000
^C

$ dig +norecurse slackb.com ns @205.251.193.96
;; communications error to 205.251.193.96#53: timed out
;; communications error to 205.251.193.96#53: timed out
^C

$ ping 205.251.193.96
PING 205.251.193.96 (205.251.193.96) 56(84) bytes of data.
^C
--- 205.251.193.96 ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 8190ms

romantomjak · 2024-11-07T11:00:06Z

omoerbeek I restarted the service but for some reason it didn't pick up the new config 🤔 I did a start/stop just now and looks like it appears on the list now!

Edit: no it doesn't. let me change it to something else

romantomjak · 2024-11-07T11:04:22Z

Ok, I think it didn't like dots in the name (which docs warn against: "Be careful not to include any dots in this setting, unless you know what you are doing." 😅

I changed it to just gh-issue-14832 and I can see it in the list now

mnordhoff · 2024-11-07T12:52:29Z

It's unusual to be unable to dig or ping/mtr one of the world's larger DNS services. Do only some zones and nameservers fail or all of them?

Is there a firewall blocking ICMP, or intercepting some DNS traffic?

Or a BGP routing issue with Amazon?

Hard to imagine network issues being related to a PowerDNS upgrade, but it is odd.

omoerbeek · 2024-11-07T15:47:10Z

Fixing the reachability of Route 53 servers is important of course.

My hypothesis is that the network issues already existed with the 4.8 recursor, but is now more visible in the metrics.
A cause could be https://docs.powerdns.com/recursor/settings.html#bypass-server-throttling-probability, resulting in more attempts at contacting the unreachable auths than before. I'll spend some time to confirm or refute that hypothesis tomorrow.

romantomjak · 2024-11-07T16:31:08Z

My firewall is pretty basic and I'm not blocking any hosts explicitly. I can receive replies if I ping hosts on the internet. I'll dig a bit more to see which hosts exactly are failing.

Yeah, potentially. I've raised this with my ISP so they take a look too

omoerbeek · 2024-11-08T14:13:44Z

Did a couple of test runs with both 4.8 and 5.1

With no auth IPs blocked
With some Route53 IPs blocked
With many Route53 auths blocked

I saw nothing like the increase in reported latency between 4.8 and 5.1 like you did. Actually, in all test runs 5.1 had a bit lower reported latency.

So I'm not sure what is causing your reported latency to be higher. Some guesses

Your Lua code matters (I did not test yet with your Lua script)
Your forward-zones-recurse=... config matters. Please show it!
For some other reason the unreachable auths have a bigger impact on measured latency in 5.1 than on 4.8, in manner not covered by my test. Maybe your query traffic is much different from my test set (I took a capture of the DNS traffic to a resolver in my local LAN as test set)
there was some other change in your setup that happened in about the same time with your recursor upgrade

Probably it's best to leave this for a moment until the routing to Route53 is fixed and see how that influences the reported latency.

romantomjak · 2024-11-08T15:30:15Z

Sure, the forward-zones-recurse is set to forward-zones-recurse=home.arpa=127.0.0.2:53 which points to a powerdns authoritative server that handles my private zones.

I also did a quick test on rec_control top-timeouts output and I'm not getting any DNS replies from the whole 205.251.192.0/21. Thanks everyone for looking into this, but I'm starting to think this one might be for my ISP. I'll report back when I hear from them.

omoerbeek · 2024-11-08T15:33:13Z

OK, in that case (forwarding to an authoritative server) you should normally use forward-zones and not forward-zones-recurse, though in your use-case the difference probably does not matter.

omoerbeek added the rec label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursor latency increased 600% after updating from 48 to 51 #14832

Recursor latency increased 600% after updating from 48 to 51 #14832

romantomjak commented Nov 6, 2024 •

edited

Loading

phonedph1 commented Nov 7, 2024

romantomjak commented Nov 7, 2024

phonedph1 commented Nov 7, 2024

omoerbeek commented Nov 7, 2024 •

edited

Loading

romantomjak commented Nov 7, 2024 •

edited by omoerbeek

Loading

omoerbeek commented Nov 7, 2024 •

edited

Loading

omoerbeek commented Nov 7, 2024

romantomjak commented Nov 7, 2024

romantomjak commented Nov 7, 2024

mnordhoff commented Nov 7, 2024

omoerbeek commented Nov 7, 2024

romantomjak commented Nov 7, 2024

romantomjak commented Nov 7, 2024 •

edited

Loading

romantomjak commented Nov 7, 2024 •

edited

Loading

mnordhoff commented Nov 7, 2024

omoerbeek commented Nov 7, 2024

romantomjak commented Nov 7, 2024

omoerbeek commented Nov 8, 2024 •

edited

Loading

romantomjak commented Nov 8, 2024

omoerbeek commented Nov 8, 2024

Recursor latency increased 600% after updating from 48 to 51 #14832

Recursor latency increased 600% after updating from 48 to 51 #14832

Comments

romantomjak commented Nov 6, 2024 • edited Loading

Short description

Environment

Steps to reproduce

Expected behaviour

Actual behaviour

Other information

phonedph1 commented Nov 7, 2024

romantomjak commented Nov 7, 2024

phonedph1 commented Nov 7, 2024

omoerbeek commented Nov 7, 2024 • edited Loading

romantomjak commented Nov 7, 2024 • edited by omoerbeek Loading

omoerbeek commented Nov 7, 2024 • edited Loading

omoerbeek commented Nov 7, 2024

romantomjak commented Nov 7, 2024

romantomjak commented Nov 7, 2024

mnordhoff commented Nov 7, 2024

omoerbeek commented Nov 7, 2024

romantomjak commented Nov 7, 2024

romantomjak commented Nov 7, 2024 • edited Loading

romantomjak commented Nov 7, 2024 • edited Loading

mnordhoff commented Nov 7, 2024

omoerbeek commented Nov 7, 2024

romantomjak commented Nov 7, 2024

omoerbeek commented Nov 8, 2024 • edited Loading

romantomjak commented Nov 8, 2024

omoerbeek commented Nov 8, 2024

romantomjak commented Nov 6, 2024 •

edited

Loading

omoerbeek commented Nov 7, 2024 •

edited

Loading

romantomjak commented Nov 7, 2024 •

edited by omoerbeek

Loading

omoerbeek commented Nov 7, 2024 •

edited

Loading

romantomjak commented Nov 7, 2024 •

edited

Loading

romantomjak commented Nov 7, 2024 •

edited

Loading

omoerbeek commented Nov 8, 2024 •

edited

Loading