Final Report: DNS Outage of 5 May 2026

Final Report: DNS Outage of 5 May 2026

Final assessment, based on the initial analysis of 8 May 2026 and the comprehensive follow-up investigation

Summary

On 5 May 2026, a DNS outage occurred during a routine DNSSEC key rollover, which significantly restricted access to .de domains for approximately three hours. The cause was an error in the software code of an in-house development, which resulted in the majority of the delivered DNSSEC signatures being invalid. Normal operations were fully restored during the night of 5 May. The findings of the initial analysis dated 8 May 2026 are confirmed.

Background: The signing system

The DNSSEC signing process for DE utilises standard software (Knot) as well as in-house developments in conjunction with Hardware Security Modules (HSMs). In April 2026, the third generation of this system since the introduction of DNSSEC in 2011 was put into operation. The systems were tested in advance and externally audited. The signing system used consists of several HSMs, distributed across two data centres that are geographically and network-technically separate from one another.

Cause of the failure

Faulty code in the rollover agent

The actual cause was an error in the software code of an in-house development that controls a rollover agent. Its task is to generate key material and load it into all connected HSMs.

Due to the faulty code, instead of generating a single key pair and subsequently loading it into all HSMs, a separate key pair was generated for each connected HSM and loaded into exactly one of them. All three key pairs generated in this way contained the same identifiers, including the key tag 33834. This was therefore not a classic key tag collision, but rather three key pairs with different contents but identical metadata.

As a result, the subsequent logic wrote one of the three ZSKs with key tag 33834 into the zone. However, as only one of the three HSMs contained the key matching the published DNSKEY record, only the RRSIGs generated by this HSM could be validated – in practice, therefore, only about a third of all signatures.

As the SOA record must be regenerated and therefore re-signed with every zone change due to the serial number, it was partly valid and partly invalid during the course of the incident.

Why was the error not detected before the system went live?

During the implementation of improvements, the faulty code was incorporated into the in-house development without the existing test scenarios covering this error case. The test environment consists of a single HSM at a single location. As a result, the software code of the rollover agent, which only exhibits its faulty behaviour when multiple HSMs are connected, was not fully executed in the test environment. The defect was therefore not detected either during test runs or in ‘cold’ parallel operation prior to commissioning.

Why was the non-validatable zone published?

The .de zone is regularly updated via the registration system. Due to the size of the zone, changes to the RRSets are incorporated incrementally; individual zone versions are not available as complete zone files. Three different continuously running testing and validation tools are in use to detect anomalies – including missing or non-validatable signatures. These systems detected the errors as intended; however, the generated notifications were not processed correctly, meaning that no timely intervention took place.

Ruled-out causes

The comprehensive analysis has explicitly ruled out the following possible causes:

  • No signs of compromise or attacks on the signing system or other DENIC infrastructure
  • No malfunction identified in the Knot name server used
  • No malfunction identified in the HSMs used
  • No classic key-tag collision

Impact

Technical impact

The validity of DNS responses in a TLD zone, which predominantly returns “referral responses” (delegation information), also depends on signed NSEC3 records – particularly when the absence of a DS record must be verified in an unsigned child zone.

Non-validatable signatures on NSEC3 records resulted in delegation information being classified as suspicious (‘bogus’) by validating resolvers. Consequently, even second-level domains for which DNSSEC is not used at all could not be resolved. Non-validating resolvers, on the other hand, delivered the .de zone without issue.

Impact on users and duration

The restrictions on the accessibility of .de domains lasted for approximately three hours. Some operators of large resolvers had temporarily suspended DNSSEC validation for .de domains, thereby mitigating the impact on their users. DENIC would like to express its sincere thanks for this support.

Measures

As part of the comprehensive analysis and evaluation, measures have been identified that affect both the software development process and the actual DNS operations. Initial findings, such as improvements to the code review process, have already been implemented. Furthermore, the incident response process and communication during an outage are being reviewed and adjusted accordingly.

In the short term, the following measures, amongst others, will be implemented:

  • Enhanced alerting: Setting up additional alerts, based on improved visibility of potential errors (including those in the continuously running testing and validation tools) and the expansion of relevant metrics.
  • Accelerated switchover procedure: Setting up an accelerated procedure to provide a valid zone backup more quickly in an emergency.
  • Partial validation prior to deployment: Setting up a partial validation of the zone prior to deployment.
  • Suspension of further ZSK rollovers: Suspension of a further ZSK rollover until the completion of further measures, including enhancing security in the software development process and improving the test environment to achieve higher test coverage.
  • External security and process analysis: Conducting an external and final security and process analysis.