Skip to content

auth: ugly complications to the SQL code to hopefully work better with mysql#16679

Open
miodvallat wants to merge 6 commits intoPowerDNS:masterfrom
miodvallat:stickyql
Open

auth: ugly complications to the SQL code to hopefully work better with mysql#16679
miodvallat wants to merge 6 commits intoPowerDNS:masterfrom
miodvallat:stickyql

Conversation

@miodvallat
Copy link
Contributor

Short description

As reported in #16571, some mysql "looks like we lost our connection to the server" do not get detected by isConnectionUsable(), and instead, we receive an error while attempting to execute the SQL statement.

This PR tries to cope with that situation in three steps:

  • when raising an exception because an SQL statement failed, allow it to contain a "you should reconnect and retry" flag (which defaults to false, i.e. don't bother to reconnect.
  • wrap all SQL statements execution into a routine which will try the statement again after reconnecting, if execution raises such an exception.
  • in the mysql backend, check for explicit "server gone" errors.

This PR is probably better read commit-by-commit, as the plumbing for that feature is quite invasive, compared to its actual usage.

You also get a bonus bugfix in the first commit, because I couldn't help but notice this while doing the plumbing.

Not sure how to test it, though. None of the existing tests should break, but I am not sure how to setup an unreliable mysql connection to exercize this code; advice welcome.

Checklist

I have:

  • read the CONTRIBUTING.md document
  • read and accepted the Developer Certificate of Origin document, including the AI Policy, and added a "Signed-off-by" to my commits
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)

Signed-off-by: Miod Vallat <miod.vallat@powerdns.com>
Signed-off-by: Miod Vallat <miod.vallat@powerdns.com>
Will be used shortly.

Signed-off-by: Miod Vallat <miod.vallat@powerdns.com>
Signed-off-by: Miod Vallat <miod.vallat@powerdns.com>
Use this in executeStatement() to retry once after forcing reconnection.

Signed-off-by: Miod Vallat <miod.vallat@powerdns.com>
…cute().

Should they occur, raise an exception with the "try again after reconnecting"
flag.

This ought to solve PowerDNS#16571.

Signed-off-by: Miod Vallat <miod.vallat@powerdns.com>
@miodvallat miodvallat force-pushed the stickyql branch 2 times, most recently from 7e2fce4 to 985c705 Compare December 24, 2025 14:46
@coveralls
Copy link

coveralls commented Dec 29, 2025

Pull Request Test Coverage Report for Build 20488499056

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 6205 unchanged lines in 89 files lost coverage.
  • Overall coverage increased (+5.5%) to 64.895%

Files with Coverage Reduction New Missed Lines %
pdns/auth-zonecache.hh 1 91.67%
pdns/base64.cc 1 80.6%
pdns/dnsseckeeper.hh 1 0.0%
pdns/dnswriter.hh 1 70.21%
pdns/dynlistener.hh 1 0.0%
pdns/cachecleaner.hh 2 93.33%
pdns/json.hh 2 0.0%
pdns/tsigverifier.hh 2 0.0%
pdns/dnspacket.hh 3 76.47%
pdns/ednssubnet.hh 3 80.0%
Totals Coverage Status
Change from base Build 20511466866: 5.5%
Covered Lines: 19612
Relevant Lines: 28623

💛 - Coveralls

@edmonds
Copy link
Contributor

edmonds commented Jan 9, 2026

Hi, miodvallat! I tested this branch and ran into some weird behavior.

On my gmysql backend I have a small zone with a few records:

mysql> select * from records where domain_id = 355444;
+---------+-----------+-----------------+------+-----------------------------------------------------+-------+------+----------+-----------+------+
| id      | domain_id | name            | type | content                                             | ttl   | prio | disabled | ordername | auth |
+---------+-----------+-----------------+------+-----------------------------------------------------+-------+------+----------+-----------+------+
| 2350767 |    355444 | vacon.vc        | SOA  | vacon.vc hostmaster.mycre.ws 25 7200 3600 604800 60 | 86400 |    0 |        0 | NULL      |    1 |
| 2350768 |    355444 | vacon.vc        | NS   | ns1.linode.com                                      | 86400 |    0 |        0 | NULL      |    1 |
| 2350769 |    355444 | vacon.vc        | NS   | ns2.linode.com                                      | 86400 |    0 |        0 | NULL      |    1 |
| 2350770 |    355444 | vacon.vc        | NS   | ns3.linode.com                                      | 86400 |    0 |        0 | NULL      |    1 |
| 2350771 |    355444 | vacon.vc        | NS   | ns4.linode.com                                      | 86400 |    0 |        0 | NULL      |    1 |
| 2350772 |    355444 | vacon.vc        | NS   | ns5.linode.com                                      | 86400 |    0 |        0 | NULL      |    1 |
| 2350773 |    355444 | vacon.vc        | TXT  | "v=spf1 -all"                                       | 86400 |    0 |        0 | NULL      |    1 |
| 2350774 |    355444 | _dmarc.vacon.vc | TXT  | "v=DMARC1; p=reject; aspf=s; adkim=s;"              | 86400 |    0 |        0 | NULL      |    1 |
| 2350775 |    355444 | public.vacon.vc | A    | 45.33.102.105                                       | 86400 |    0 |        0 | NULL      |    1 |
| 2350776 |    355444 | public.vacon.vc | AAAA | 2600:3c02:e000:1d3::1                               | 86400 |    0 |        0 | NULL      |    1 |
+---------+-----------+-----------------+------+-----------------------------------------------------+-------+------+----------+-----------+------+
10 rows in set (0.00 sec)

I started up pdns_server from this PR's branch (commit 985c705) with a basic pdns.conf:

loglevel=9
query-logging=yes
log-dns-details=yes
log-dns-queries=yes

local-address=127.0.0.1:53001

8bit-dns=yes
allow-axfr-ips=127.0.0.0/8
also-notify=
only-notify=
primary=yes
secondary=yes
security-poll-suffix=
xfr-cycle-interval=3600

receiver-threads=1
distributor-threads=1

cache-ttl=0
query-cache-ttl=0
negquery-cache-ttl=0
zone-cache-refresh-interval=3600

launch=gmysql
gmysql-dnssec=yes
gmysql-host=127.0.0.1
gmysql-dbname=pdns_staging_1
gmysql-user=pdns_staging_1
gmysql-password=password

I disabled the SQL query caches so that queries would always hit the backend.

Then I sent some digs and got the expected answers:

$ dig +norec -p 53001 @127.0.0.1 vacon.vc -t SOA; dig +norec -p 53001 @127.0.0.1 vacon.vc -t NS; dig +norec -p 53001 @127.0.0.1 vacon.vc -t TXT; dig +norec -p 53001 @127.0.0.1 public.vacon.vc -t ANY

; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 vacon.vc -t SOA
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2059
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;vacon.vc.			IN	SOA

;; ANSWER SECTION:
vacon.vc.		86400	IN	SOA	vacon.vc. hostmaster.mycre.ws. 25 7200 3600 604800 60

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (UDP)
;; WHEN: Fri Jan 09 00:57:48 UTC 2026
;; MSG SIZE  rcvd: 92


; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 vacon.vc -t NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61976
;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;vacon.vc.			IN	NS

;; ANSWER SECTION:
vacon.vc.		86400	IN	NS	ns5.linode.com.
vacon.vc.		86400	IN	NS	ns2.linode.com.
vacon.vc.		86400	IN	NS	ns4.linode.com.
vacon.vc.		86400	IN	NS	ns3.linode.com.
vacon.vc.		86400	IN	NS	ns1.linode.com.

;; Query time: 4 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (UDP)
;; WHEN: Fri Jan 09 00:57:48 UTC 2026
;; MSG SIZE  rcvd: 137


; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 vacon.vc -t TXT
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10801
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;vacon.vc.			IN	TXT

;; ANSWER SECTION:
vacon.vc.		86400	IN	TXT	"v=spf1 -all"

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (UDP)
;; WHEN: Fri Jan 09 00:57:48 UTC 2026
;; MSG SIZE  rcvd: 61


; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 public.vacon.vc -t ANY
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44066
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;public.vacon.vc.		IN	ANY

;; ANSWER SECTION:
public.vacon.vc.	86400	IN	A	45.33.102.105
public.vacon.vc.	86400	IN	AAAA	2600:3c02:e000:1d3::1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (TCP)
;; WHEN: Fri Jan 09 00:57:48 UTC 2026
;; MSG SIZE  rcvd: 88

Then I killed the MySQL database connections from the database server side:

mysql> show processlist;
+-----+-----------------+-----------------+----------------+---------+------+------------------------+------------------+
| Id  | User            | Host            | db             | Command | Time | State                  | Info             |
+-----+-----------------+-----------------+----------------+---------+------+------------------------+------------------+
|   5 | event_scheduler | localhost       | NULL           | Daemon  | 4332 | Waiting on empty queue | NULL             |
| 122 | root            | localhost       | pdns_staging_1 | Query   |    0 | init                   | show processlist |
| 124 | pdns_staging_1  | localhost:41332 | pdns_staging_1 | Sleep   |   53 |                        | NULL             |
| 125 | pdns_staging_1  | localhost:41340 | pdns_staging_1 | Sleep   |   14 |                        | NULL             |
| 126 | pdns_staging_1  | localhost:41354 | pdns_staging_1 | Sleep   |   14 |                        | NULL             |
+-----+-----------------+-----------------+----------------+---------+------+------------------------+------------------+
5 rows in set, 1 warning (0.00 sec)

mysql> kill 124;
Query OK, 0 rows affected (0.01 sec)

mysql> kill 125;
Query OK, 0 rows affected (0.01 sec)

mysql> kill 126;
Query OK, 0 rows affected (0.01 sec)

Then I re-ran the same digs and got some interesting answers:

$ dig +norec -p 53001 @127.0.0.1 vacon.vc -t SOA; dig +norec -p 53001 @127.0.0.1 vacon.vc -t NS; dig +norec -p 53001 @127.0.0.1 vacon.vc -t TXT; dig +norec -p 53001 @127.0.0.1 public.vacon.vc -t ANY

; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 vacon.vc -t SOA
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 16142
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;vacon.vc.			IN	SOA

;; Query time: 12 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (UDP)
;; WHEN: Fri Jan 09 00:58:15 UTC 2026
;; MSG SIZE  rcvd: 37


; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 vacon.vc -t NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63515
;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;vacon.vc.			IN	NS

;; ANSWER SECTION:
vacon.vc.		86400	IN	NS	ns4.linode.com.
vacon.vc.		86400	IN	NS	ns3.linode.com.
vacon.vc.		86400	IN	NS	ns2.linode.com.
vacon.vc.		86400	IN	NS	ns1.linode.com.
vacon.vc.		86400	IN	NS	ns5.linode.com.

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (UDP)
;; WHEN: Fri Jan 09 00:58:15 UTC 2026
;; MSG SIZE  rcvd: 137


; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 vacon.vc -t TXT
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58656
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;vacon.vc.			IN	TXT

;; ANSWER SECTION:
vacon.vc.		86400	IN	TXT	"v=spf1 -all"
vacon.vc.		86400	IN	TXT	"v=DMARC1; p=reject; aspf=s; adkim=s;"

;; Query time: 4 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (UDP)
;; WHEN: Fri Jan 09 00:58:15 UTC 2026
;; MSG SIZE  rcvd: 110


; <<>> DiG 9.20.5-1-Debian <<>> +norec -p 53001 @127.0.0.1 public.vacon.vc -t ANY
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 29411
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;public.vacon.vc.		IN	ANY

;; Query time: 12 msec
;; SERVER: 127.0.0.1#53001(127.0.0.1) (TCP)
;; WHEN: Fri Jan 09 00:58:15 UTC 2026
;; MSG SIZE  rcvd: 44

Note the REFUSED responses, and the corrupted TXT RRset answer which appears to be a combination of two different TXT RRsets in the zone.

And here is what the server logged:

Jan 09 00:57:48 dnstest pdns_server[275497]: Remote 127.0.0.1 wants 'vacon.vc|SOA', do = 0, bufsize = 1232
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 288 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 314 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747688944: select kind,content from domains, domainmetadata where domainmetadata.domain_id=domains.id and name=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747688944: 334 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747688944: 345 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: Remote 127.0.0.1 wants 'vacon.vc|NS', do = 0, bufsize = 1232
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 545 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 576 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 159 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 182 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: Remote 127.0.0.1 wants 'vacon.vc|TXT', do = 0, bufsize = 1232
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 352 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 379 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 121 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 140695747698656: 139 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: TCP Remote 127.0.0.1 wants 'public.vacon.vc|ANY', do = 0, bufsize = 1232
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: 245 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: 270 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: 109 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: 119 us total to last row
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: 100 us to execute
Jan 09 00:57:48 dnstest pdns_server[275497]: Query 94146740459376: 108 us total to last row
Jan 09 00:58:15 dnstest pdns_server[275497]: Remote 127.0.0.1 wants 'vacon.vc|SOA', do = 0, bufsize = 1232
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 277 us to execute
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 283 us total to last row
Jan 09 00:58:15 dnstest pdns_server[275497]: Remote 127.0.0.1 wants 'vacon.vc|NS', do = 0, bufsize = 1232
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 363 us to execute
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 388 us total to last row
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 148 us to execute
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 165 us total to last row
Jan 09 00:58:15 dnstest pdns_server[275497]: Remote 127.0.0.1 wants 'vacon.vc|TXT', do = 0, bufsize = 1232
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 346 us to execute
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 371 us total to last row
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 167 us to execute
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695748751312: 184 us total to last row
Jan 09 00:58:15 dnstest pdns_server[275497]: TCP Remote 127.0.0.1 wants 'public.vacon.vc|ANY', do = 0, bufsize = 1232
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695344668944: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and name=? and domain_id=?
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695344668944: 196 us to execute
Jan 09 00:58:15 dnstest pdns_server[275497]: Query 140695344668944: 202 us total to last row

When I revert to a pdns_server binary built from the current master (commit 42eb43e) I don't see this behavior at all, I get completely normally responses after killing the database connections.

(Edit: Just to make sure I rebased this branch on top of the current master and still got the same faulty behavior.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants