Sunday, November 1, 2020

The Duct Tape Holding the Internet Together (2017)

On Sunday July 2 starting at 4:30 PM, the ca.la domain name and all subdomains (site, API, other tools) were inaccessible for almost 12 hours. We registered an alternate domain name and directed customers to it via social media, but we still saw a significant drop in site traffic and our iPhone app was unavailable during this time.

When we have downtime, we run a postmortem to figure out what went wrong and how we can avoid it next time. We found this one pretty interesting, and we thought you might enjoy reading it too.

Root Cause

Due to a security issue with our domain registry, another registrar (not the one that manages our domain) was able to incorrectly mark our domain name as pending deletion. Per the ICANN specification, when a domain has “Pending Delete” status, it’s no longer included in the zone file and becomes inaccessible even if DNS records are configured correctly.

Diagnosis

As soon as the first alert came in, we began to investigate:

  • None of our vendors were reporting any outages.
  • All servers were happily serving requests when queried directly. When accessed via their *.ca.la subdomains they were not available.
  • Our nameserver records were still pointed to CloudFlare, which was up and running serving the correct responses.

On a hunch, we did a whois lookup:

$ whois ca.la
Domain Name: CA.LA
...
Registry Expiry Date: 2020-04-20T23:59:59.0Z
Registrar: Gandi SAS
Registrar IANA ID:
Domain Status: pendingDelete https://icann.org/epp#pendingDelete

Though we had renewed the domain for several years to make sure we had no expiration surprises, it was marked as “pending delete”. Everything appeared normal in our registrar’s account dashboard — the domain was active and unexpired, and the expiration date was correctly shown as 2020.

History

On March 1 we transferred our domain name (ca.la) from our original registrar (www.la) to Gandi. Everything went smoothly, and we renewed the domain with Gandi for several years into the future. Checking the whois results gave us just what we expected:

Registry Expiry Date: 2020-04-20T23:59:59.0Z
Registrar: Gandi SAS

A few weeks later, we started receiving weekly emails from the previous registrar that our domain was close to expiration. We confirmed with Gandi that the domain was fully transferred to them and renewed, and ignored these notifications going forward, assuming this was just a bug in their billing system that would not affect us.

The How

Surprisingly, after some digging we found out that many of the procedures around domain name management are essentially done on the honor system.

When you transfer a domain between registrars, all the authorization codes and email confirmations are a layer of paperwork on top of the simple fact that the new registrar just announces that it owns your domain now, and promises they’ve confirmed it.

The same is apparently true for EPP codes like “pending delete”; as long as a registrar has login access to a given registry — in this case CentralNIC— there’s nothing stopping them issuing any updates to any domain of their choosing.

Unlike generic top level domains (.com, .net, .fashion etc), ICANN has little or no control over country-specific “ccTLDs”. This means the registry operators have full license to manage their own registry’s security — or lack thereof— however they want.

We received an email from the previous registrar at 3:00 PM on July 2 titled “Domain expiry notice”, but didn’t see this until after the domain became inaccessible at 4:30 PM. Shortly after this, the registrar automatically issued a “delete” command to the registry, which was accepted without question.

Recovery

As soon as we identified the issue, we contacted both www.la and Gandi to seek assistance. Our point of contact at www.la was also able to connect us with CentralNIC, the registry for .la and several other extensions.

As our entire domain was unavailable including MX records, we were unable to receive email at any @ca.la email address during the downtime period. We initially attempted to log into the www.la dashboard to see any issues we might be able to resolve ourselves, but were unable to as the login verification process required clicking on a link in an email.

Unfortunately the www.la engineering team were unable to assist until the next morning, as this incident happened at 12:30 AM in the UK where they are based.

We contacted CentralNIC and were told they do not deal with customer requests directly, and any issues need to be escalated via Gandi. Gandi was very responsive, and reached out to CentralNIC on our behalf. Though Gandi followed up regularly, they didn’t get a reply until July 9 — long after the situation was resolved.

We called ICANN, who advised that they’re unable to assist with ccTLD issues, but gave us the contact information for LANIC, the manager for the .la TLD. We contacted several individuals at LANIC and have still not heard back.

Since a resolution wasn’t forthcoming, we registered a new gTLD domain name, updated our service configurations to refer to this instead, and began directing customers there via social media.

When the www.la engineering team came online, they were able to revoke the “pending delete” status manually. The domain quickly became available again.

Prevention & Improvements

Though we believe major fault lies with both the registrar and the registry itself, we take full responsibility for not following up sooner when we saw unexpected notification emails.

We’ve centralized our site and API configuration to make it easier to roll between different domains in the future, and removed unnecessary layers of aliases to several services.

We’ve made sure the email address on our Gandi account is located on a domain name that’s not registered via Gandi, to avoid communication issues if a similar situation happens again.

We hope the information we’ve provided to the registrars and registry involved will help them develop security measures to prevent this happening to other customers in the future.

If you’ve ever run into a similar issue in the past, or have any ideas on how we can prevent problems like this from happening again, we’d love your comments and suggestions.



from Hacker News https://ift.tt/2uMSptf

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.