Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Let's Encrypt Acme API Outage (status.io)
153 points by fastest963 on June 15, 2023 | hide | past | favorite | 75 comments


This is because I discovered that Let's Encrypt was issuing non-compliant certificates: https://bugzilla.mozilla.org/show_bug.cgi?id=1838667


So you can legitimately put "broke the internet" on your resume :D


That would be impressive except for all of the AWS us-east-1 engineers that can claim the same thing


They didn't do it all by themselves though!


Like you don't have things on your resume that your team did and not just you


One thing I like about the computing world is that people put "wrote the package that frobbed jpegs that in production frobbs over 1 million jpegs per hour". Anything less specific means "I was somewhere in the building when this was written and deployed".

When I worked in pharma people would say something like "I joined a program targetting neurology early in the preclinical phase and developed assays, until first-in-human four years later. Started as senior lab technician and departed as a junior assistant director for preclinical QC."

Took me years to understand how the sociology, regulatory dynamics, and science of the two fields legitimately resulted in these utterly different approaches.


Do you have a blog post or writeup on how you discovered that? Thanks!


This all happened less than 2 hours ago, but a quick summary is that my Certificate Transparency monitor, Cert Spotter (https://sslmate.com/certspotter) performs various sanity checks on every certificate that it observes. At 15:41 UTC today, I started getting alerts that certificates from Let's Encrypt were failing one particular check. I quickly emailed Let's Encrypt's problem reporting address, and Let's Encrypt promptly suspended issuance so they could investigate. I've lost count of how many CAs I've detected having this particular problem, so perhaps it is time to blog about it (https://www.agwa.name/blog if you're interested).


That's awesome!! I wonder if let's encrypt runs sanity checks before/after issuing certs too?


They "lint" certificates before issuance, as do most CAs. However, I don't think any linters check for this problem, as it requires access to more than just the certificate (the linter would need access to either the precertificate or a database of Certificate Transparency log keys).


We will add a lint to Boulder for precertificate and certificate correspondence to ensure this class of problem never happens again.

It would be nice to add this to Zlint, but we'd need a new interface that could be given both a precertificate and certificate to co-lint. Other than this one correspondence check, I'm not sure if there's any other lints that would fit that pattern.


Are these linters open source?


Yup, the two most popular are:

https://github.com/zmap/zlint

https://github.com/certlint/certlint

They each have their strengths and weaknesses, so CAs are advised to use both.


this is why i will always love hacker news. thank you


Curious people contributing to the ongoing functioning of critical systems at scale. Thank you for your effort!

https://xkcd.com/2347/


I would love to read a blog of yours with more information.


This iso so awesome. Thank you for sharing. I hope you do write about that problem. I'd love to learn something new.


I will also throw out a quick vote that I'd be interested in reading a blog post about it.


Sounds like somebody didn't properly seed their random number generator


The problem is actually WAY more subtle, and pretty hard to understand unless you really get in the weeds of Certificate Transparency and certificate policy, but I'll give a shot at providing a concise explanation.

Let's Encrypt has produced two signed artifacts with the same serial number:

1. A precertificate: https://api.certspotter.com/v1/certs/22700bd0d70ac5790e6ae5b...

2. A certificate: https://api.certspotter.com/v1/certs/c0916d24ac8844522b36950...

A precertificate is not a certificate, but it implies the existence of a corresponding certificate which can be constructed by applying an algorithm to the precertificate.

Let's Encrypt intended to create a precertificate which would result in (2) when applying the algorithm to (1). Unfortunately, applying the algorithm to (1) results in a different certificate, (3), presumably because of some bug in Let's Encrypt. Since (2) and (3) have the same serial number, it's a violation of the prohibition against duplicate serial numbers.

An easier-to-understand description of the problem is that Let's Encrypt was producing precertificates that didn't match the final certificate, but the compliance violation is duplicate serial numbers, which is why I worded my compliance bug the way I did.


Thank you for sharing your knowledge here.

A few questions:

If applying the algorithm to (1) produced (3), what produced (2)?

How can "no duplicate serial numbers" be enforced by any browser without having a store of all certificates? Is it simply a best-effort? Will the browser have a mapping from <serial number> to <certificate>, and whenever it sees a certificate, it will check this map to see if it has seen that serial number on a separate certificate?


> If applying the algorithm to (1) produced (3), what produced (2)?

I believe the root of the problem is that Let's Encrypt is creating certificates and precertificates independently, instead of creating a precertificate and then applying the algorithm to create the corresponding certificate. Since their processes for certificates and precertificates got out-of-sync, they ended up producing (2) instead of (3).

> How can "no duplicate serial numbers" be enforced by any browser without having a store of all certificates?

Browser software doesn't enforce this. It can only be enforced by scanning Certificate Transparency logs looking for violations.


Indeed, although the de jure requirement in the policy isn't actually important per se, the only practical way to obey the policy is to do the thing we want you to do, so that's what you'll actually do, but the way the policy is phrased makes enforcement practical.

This is different from a "Brown M&M" policy where the purpose of the policy is to easily check that you are actually reading and obeying the policy document. Here the policy is worded in a way that doesn't directly achieve what we want, but is measurable, whereas what we want isn't, but the only practical way to achieve policy compliance is to do what we wanted anyway.


Firefox does incidentally error if it sees duplicate serial numbers from the same issuer, though it wouldn't detect this case since the browser won't see precertificates in a TLS handshake.

https://support.mozilla.org/en-US/kb/Certificate-contains-th...

I don't think this is intended to be a security feature, but simply an error from the depths of NSS where some code uses (issuer, serial) as a unique index.


Requiring Certificate Transparency in the browser doesn’t directly prevent this, but (as in this instance) it ensures there is public data anyone can check to see if this situation has occurred.


That's absolutely wild that they had no test to detect that. I wonder what other obvious bugs are floating around in there.


Fascinating! What kind of effects can this have in real life? Are they going to have to revoke the affected certificates?


They will indeed need to revoke the affected certificates within 5 days, per the Baseline Requirements.

Also, the affected certificates won't be accepted by Certificate Transparency-enforcing browsers (Chrome and Safari) because of the precertificate mismatch.


Interestingly, Chrome does seem to be accepting the affected certificates based on some limited testing. I'm not sure why that would be, though. Chrome certainly requires SCTs to be present, but perhaps isn't (always?) checking the SCTs signatures over the certificates. Once we've wrapped this incident up, I'll have to follow up on that.

Safari does seem to be rejecting them as expected.


I picked 10 affected sites at random, and Chrome correctly rejected every one with ERR_CERTIFICATE_TRANSPARENCY_REQUIRED.

If Chrome isn't validating SCT signatures, that's a serious bug that would enable bypass of CT enforcement, so I'm sure the Chrome team would want to hear more about what you've seen.


It turns out my test was invalid:

I had launched a VM to test, which turns out had an old Chromium installed (via the system package manager, so not self-updating).

Chromium disables CT checking if it doesn't have an updated list of CT logs (either by component-updater or updating chrome itself).

I see ERR_CERTIFICATE_TRANSPARENCY_REQUIRED after updating.


Why does Firefox not enforce it?


Here's the 7-year-old bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1281469

I don't know why it is taking them so long, but it makes me sad.


Firefox, the browser, only cares if the certificate is valid (not expired, not revoked, ultimately signed by a root CA it trusts). It does not keep tabs on every certificate ever issued. You wouldn't like it if Firefox did an online check with a central authority for every website you visited, nor would you like it to bundle every single certificate ever issued (or even just serial numbers).

Mozilla, the authors of the browser, are part of the CA/Browser Forum, which holds the threat of complete distrust in all web browsers against CAs, which compels CAs to be open and provide logs of all the certificates they've issued and prove they're not mis-issuing certificates. All those extra checks happen here.


> You wouldn't like it if Firefox did an online check with a central authority for every website you visited

Enforcing Certificate Transparency does not require doing an online check for every website you visit.

> Mozilla, the authors of the browser, are part of the CA/Browser Forum, which holds the threat of complete distrust in all web browsers against CAs, which compels CAs to be open and provide logs of all the certificates they've issued and prove they're not mis-issuing certificates.

The CA/Browser Forum does not require CAs to log the certificates that they issue. CT is enforced entirely within the certificate validator code, and it is a major shortcoming that Firefox does not do it.


You are correct: https://github.com/google/certificate-transparency/blob/mast...

You can embed CT attestations (SCTs) in the certificate itself, so yes, provided the CA is in cooperation with CT log operators, and deliberately does the pre-certificate -> SCTs -> real certificate dance, it is possible for a browser to validate embedded SCTs without an online check.

However, that assumes that the CA actively does that, they don't have to. Neither does the server. What's compelling them to is _policy_, set by Google and Apple, that their respective browsers won't accept certificates _without_ CT attestations. Google's policy specifically requires that one of the SCTs on a certificate must be a CT log run by Google. Google also controls the list of CT logs that Chrome will consider as valid CT logs, as part of deciding if an SCT is valid. Antitrust, anyone?

I was trying to make a similar point about Firefox - policy vs code. And rather than saying that it's specifically the CA/Browser Forum setting policy (which it does, but only baseline policy, which does not include CT), each org in the CA/Browser Forum has their own root cert inclusion program with their own policies, that all draw from baseline policy then add to it. You are right, _baseline_ policy does not require CT....

... and neither does _Mozilla's_ policy, now I've scanned through it. It actively acknowledges that CT exists (in that it mandates that if you issue a precertificate for CT, you _must_ issue the completed certificate), but it does _not_ require CAs to use CT. In stark contrast to Google and Apple.

Perhaps this is why they also don't implement CT checking in Firefox?


> It actively acknowledges that CT exists (in that it mandates that if you issue a precertificate for CT, you _must_ issue the completed certificate)

I don't think that's what the document says. I don't see a requirement to issue the final certificate. This portion is putting pre-certificates into scope of the agreement in that a mis-issued pre-certificate is evidence of intent to mis-issue a final certificate. So, before issuing a pre-certificate, a CA has to be prepared to revoke the final certificate, even if they never actually issue the final certificate; as well as prepared to defend the issuing of the final certificate.

Presumably, this is to cover from CAs claiming a pre-certificate was issued for testing only, and wasn't going to be issued as a final certificate. Also, I'd presume that a CA issuing pre-certificates so they could embed SCTs would abort issuance if they were unable to get a response from the certificate log, but there's always the chance that the submission went fine and the pre-certificate is logged, but the response didn't make it, so the CA would abort.


I was involved in the drafting of that language and you are 100% correct.


There is a major distinction between root store policy and CT policy which you are missing.

Root store policy contains requirements which are enforced by audits, and if a CA violates the root store policy it is considered misissuance requiring them to revoke the offending certificates and file an incident report. Neither Chrome nor Apple root store policies require CT.

CT policy describes what CAs must do for their certificates to be accepted by the certificate validation code. CT policy is enforced entirely by code. It is not an incident if a CA doesn't comply with CT policy; it just means their certificates won't be accepted.


Chrome also decides what CAs they will accept in Chrome in the first place, so CT doesn’t give them any extra monopoly levers.


Certificate Transparency verification only requires the server provide stapled proof (either to the certificate itself or a OCSP response) that the certificate was submitted to the public logs. It does not involve any extra requests from the browser to a third party; at most it involves a periodic request from the server to the CA with no client-specific data.


> You wouldn't like it if Firefox did an online check with a central authority for every website you visited

And yet OCSP stapling is still far from ubiquitous.


I wonder how much this has to do with OCSP stapling being so badly implemented in Apache 2 and nginx (don't know about the other servers). This article from 2009 [1] still seems current, at least I can attest that I still have issues with nginx. Also this super user Q&A [2] which suggest priming OCSP with a cron job because nginx does not do its job by itself.

[1] https://blog.apnic.net/2019/01/15/is-the-web-ready-for-ocsp-...

[2] https://superuser.com/questions/1635407/ocsp-not-working-con...


Regular reminder that the best ACME clients will fall back to other CAs if one is down. For example caddy does this. (Disclosure yada yada)


The missing disclosure is "I'm the author of the Caddy web server". I'm not sure why you would do it halfway.


Because it's distracting, honestly. Anyone could say "Caddy does this" and they would objectively be correct. The point is about ACME client features and giving an objectively true example.


Then don't put the disclosure at all, it's just weird to do the "distracting" part but not give the information.


Especially if their next comment is a reply to someone else and states "Mine is the gold standard, here's proof".

Just make the disclosure clear but short if the comment is a throwaway, or don't mention your own service, otherwise it's a race to the bottom.


That wouldn't work for caddy if you also follow the best practice to have a CAA record pointing to the issuer and account URL, unless caddy is also managing DNS records in addition to being an HTTP server. (I don't know if it is, but I would think it's a layering violation for an HTTP server to also be a DNS server.)


This is true, if you manually configure a CAA limited to just one CA, then you lose that benefit of redundancy.

I recommend trusting multiple CAs (but not too many): https://matt.life/writing/the-acme-protocol-in-practice-and-...

> (I don't know if it is, but I would think it's a layering violation for an HTTP server to also be a DNS server.)

Caddy 2 is, at its core, a server of servers. The HTTP server is just an "app module" for Caddy. There are other servers; I don't know of a DNS server app yet. (CoreDNS is a fork of Caddy v1, though.)


> the best ACME clients will fall back

Are there others that do it, or are you just saying that yours is the best?


There are others that do it.

While we're on the topic, I will say that Caddy has independently and repeatedly been cited as the gold standard of ACME clients, "the best client experience," and "we hope to see other servers follow Caddy's lead." [0] [1] [and others I don't have links to currently].

[0]: https://www.youtube.com/watch?v=OE5UhQGg_Fo

[1]: https://jhalderm.com/pub/papers/letsencrypt-ccs19.pdf


I used Caddy and was just wondering if my stuff would be broke, so the GP comment was useful to me, and snarky as your reply was.


Sure, there are a handful of useful points there (off the top of my head: it's possible to work around this failure by using multiple CAs, multiple free ACME CAs exist, caddy implements this solution). I'm just 1. slightly frustrated that caddy's author never seems to miss a chance for self-promotion (at least he's started alluding to the fact that it's his project), and 2. actually curious whether any other ACME clients are implementing that fallback.


> I'm just 1. slightly frustrated that caddy's author never seems to miss a chance for self-promotion (at least he's started alluding to the fact that it's his project),

Sorry that I happen to be the author. It's really not about that though -- it just matters that an ACME-native HTTPS server exists. We need more integrated fully-native ACME clients.

> actually curious whether any other ACME clients are implementing that fallback.

There are at least one or two others. I don't recall which ones at the moment but I think Certify the Web may be one. Edit: mod_md is another apparently!


Right, it's astounding to me that outfits like Microsoft didn't just immediately ship decent ACME implementations. People seem to have settled for third party bolt-on solutions. It's like you wake up in an alternate world where yeah, no cars come with seat belts, but of course everybody buys seatbelts for the car, there's usually a store next to the car dealer which sells them. Um. What?

Most of the "popular" software in this space is garbage. I spent the entire day today (aside from meetings and helping other people debug problems) wrestling with the fact Apache seems to be designed so heavily with a C programmer mindset that even the idea of reporting problems has never occurred to them. Just blunder on, it'll be fine, don't think about it. You can sprinkle complete nonsense into Apache configuration files and, until you trip an actual syntactical error and blow up their parser, Apache just presses on anyway with the nonsense values you provided, and if that doesn't work, no reason to report it just do whatever was the default and hope that's OK.

As far as I can tell, in the wild the result is a lot of Apache configuration is complete nonsense, but hey no errors are reported, so, copy, paste, move on.


Apache mod_md has fallback too, https://github.com/icing/mod_md#acme-failover I'm just a user, not the author, and I didn't try the fallback. I'm more worried about stuff breaking if I switch issuers than certs expiring without me noticing. I've got some embedded junk that hits my website and has weak cert validation, so better to stick with something that works.


For self promotion that was pretty light. You ever seen the Sourcehut guy on here?


Promoting went from HTTP headers to HN comments.


How would you feel if most of the times you post online, someone shows up and mentions a mistake you made years ago? Not cool.


Does mholt consider it a mistake? I'm aware that it was reverted (https://github.com/caddyserver/caddy/pull/1866) and that mholt found the whole thing difficult (which is my attempt to neutrally summarize https://caddy.community/t/the-realities-of-being-a-foss-main... accurately), but that is a somewhat different statement. If so, then yes, it's unkind and unhelpful to keep bringing it up, but if no then it's useful to keep previous behavior in mind when evaluating the product.


Yes, it was a mistake. (Source: I'm a Caddy maintainer, and it comes up in our discussions from time to time.) The reason it was done was because Caddy needed some source of revenue since Matt made it his full time job, and he assumed the sponsors thought they would appreciate the extra promotion (and as you can see at the bottom of that github link, he tried to alert them but received no feedback). Of course, thinking that no feedback was an implicit "sure" was a lapse in judgment, but we're all human.

Remember, this was six years ago. That's an eternity in this industry. Caddy is a very different project than it was then, and Matt has a different and more stable revenue stream than he did then. We can promise we'll never attempt the same thing again.

But seriously, this comes up in like one in ten HN threads where Matt comments, it's exhausting to keep telling people "okay can you please forget what you remember from 6 years ago and look at the project for what it is now?"


Yes, I understand the reasons why caddy did all the things I objected to; money is actually important, telemetry can be useful to devs, and the early (non) packaging decisions were clearly meant to optimize the on-ramp. But just because I understand a decision doesn't mean I agree, and doesn't mean I'm not going to include that information in my own decision to avoid a program.

> Remember, this was six years ago. That's an eternity in this industry. Caddy is a very different project than it was then, and Matt has a different and more stable revenue stream than he did then. We can promise we'll never attempt the same thing again.

I think this was meant to be reassuring, but it really makes it sounds more like it was purely a pragmatic thing. Okay, so now Caddy has stable cash flow, so no adware. Next year the economy lurches and the money goes away; is caddy going to start making awkward decisions again?

> But seriously, this comes up in like one in ten HN threads where Matt comments, it's exhausting to keep telling people "okay can you please forget what you remember from 6 years ago and look at the project for what it is now?"

You know that line about how people will forget what you do, but not how you made them feel? I remember exactly how I felt when the wonderful server software I was using decided to start shipping adware. And now, having backed off but never actually apologized, you want people to just forget about the whole thing? That's not how it works. Edit: Now that we've had this exchange, and at least you have called it a mistake and said it won't happen again, I can update my evaluation based on that. I would suggest that saying that six years ago in the announcements channel would have reduced the number of times you needed to have this conversation.

Edit2: Realized there was a much more succinct way of answering: If someone feels that you wronged them, you don't get to choose when they get over it. A lot of users felt that caddy treated them poorly. And honestly, even if the project had said then what you're saying now, some of them would still remember that.


If your certificate was issued after the start of the incident but before Let's Encrypt suspended issuance, then the certificate is currently not working in Chrome or Safari.

This incident wasn't just about downtime, it was also about issuing non-functional/non-compliant certificates.


Is this because the certs were revoked? (Revocation is broken ;P)

Caddy staples Valid OCSP responses to all certificates that have an OCSP responder, so if browsers aren't accepting that, then arguably the clients are broken, because that response is valid until a few days from now. But before the 100% valid and trusted OCSP staple expires, Caddy will get a new staple that presumably says Revoked, and replace them right away before browsers would ever see a Revoked status.

(Revocation is broken ;P)


No, it's because the SCTs in the certificate have invalid signatures.


Ah, that makes sense!

I wonder if we should be doing some basic sanity checks on newly obtained certificates in Caddy, and treat this as a failure, and try the next configured CA instead.

(Obviously SCT signatures will require some external resource so we would have to weigh that a bit more, maybe make it configurable...)

Issue opened here to discuss, though it does sound troublesome/tedious: https://github.com/caddyserver/certmagic/issues/240


Yeah, the question is how far you want to go. To be safe against every possible CA screwup, you basically have to re-implement every browser's entire certificate validation engine and run certificates through each one. That would obviously be very hard, and could do more harm than good if it falls out-of-sync with browsers.


It would make sense to try to check this, if there's a reasonable way to access a database of trusted certificate logs. If not, it's going to be tricky; I wouldn't fail a certificate that had a SCT from an unknown log, because it might be valid and you don't know. Etc.


What other CAs you recommend aside from letsencrypt? I'm a bit wary of trying some random CAs that offers free certificates aside from letsencrypt.


It's good to be wary.

You can trust the ACME CAs listed on this site: https://www.acmeisuptime.com/ (Although, I think that list could use some updating. I'll ping the author.)

Personally I would use Let's Encrypt, ZeroSSL (Sectigo) and Google Trust Services. There are, of course, others. But which ones you choose depend on your requirements and such. (Some offer business support, for example.) SSL.com and Sectigo also offer ACME but I am not sure how performant their CA software is.


Looks like they're back online with a fix.


On the bright side, that's actually one of the lower-impact things to have an outage on, IMO; if you're using it the recommended way, an outage would only really affect new certs, with older certs just getting renewed slightly later.


If you’re onboarding new users into something like a SaaS or PaaS it could be a bigger deal.


What's the impact of an outage like this? ACME renewals should happen daily starting 30 days before expiry so no one should have had a cert expire due to this. New certificates wouldn't have been issued, so that's impact, although I suspect most new certs aren't taking traffic immediately (i.e. setting up a new server).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: