We’ve been searching the cause for sporadic name resolution errors for quite some time now – “sporadic error” in terms of incorrect name resolution and dying named processes once or twice in a couple of weeks, nothing you’d track down easily. We weren’t able to reproduce the error, but we got lucky and (at that time unknowingly) hit another symptom of the same error cause and were finally able to track things down.
For years, we’ve been running our “named” processes against LDAP, using the so-called “SDB:LDAP” interface. The DNS data is stored in a hierarchical LDAP tree, with manual and automatic updates of that data by our systems management automation processes. When using SDB, every DNS lookup leads to a data back-end lookup, there’s no need to “export” data into named zone files once the data was changed in the back-end storage. The DNS environment is spread across various servers, all those name server processes have an LDAP server close by, avoiding SPOFs. Distribution of the data is handled by OpenLDAP’s replication mechanisms.
We’re using the SDB-LDAP implementation, which was the only suitable one at that time. It comes as “contributed software” and is not that well maintained as the BIND software itself – which didn’t look like a big problem.
We’ve noticed spurious false replies to our DNS queries, perhaps once or twice a month. And those were noticed only when i.e. ssh reported that the server key had changed while contacting a certain host – a false alert, obviously, as the immediate retry to set up the ssh session worked flawless and the key file was unchanged on the server (as was the known keys file on the client).
Sometimes, the named process dropped dead without obvious cause.
Lately, we’ve received quite a few syslog entries from the named processes that the LDAP server had to be reconnected. Checking the LDAP server manually didn’t reveal such problems at all.
We had had a few glimpses at a more modern LDAP integration (bind-dlz) for our named, and had seen that some (unstable) updates were available for SDB:LDAP, too. The change log pointed out that multi-threading issues were fixed, which got our attention: Quite obviously, our nameds are running multi-threaded (checked that via “ps”), so we turned that off (or rather limited that to a single thread, via the “-n 1” command line option for named).
Ever since, we’ve had no more spurious reconnects to the LDAP server and the named process looks stable. Sample time is a bit short, but I’m sure that even within the next weeks we won’t see any false resolutions no more, either.
PS: No, not actually closed. We’ll for sure be having a real close look at the newer back-end “bind-dlz” and its LDAP approach. The first “glimpse” made us feel a bit uneasy, but I’ll let you know how things went as soon as we have finished our tests.