One thing I like about the computing world is that people put "wrote the package that frobbed jpegs that in production frobbs over 1 million jpegs per hour". Anything less specific means "I was somewhere in the building when this was written and deployed".
When I worked in pharma people would say something like "I joined a program targetting neurology early in the preclinical phase and developed assays, until first-in-human four years later. Started as senior lab technician and departed as a junior assistant director for preclinical QC."
Took me years to understand how the sociology, regulatory dynamics, and science of the two fields legitimately resulted in these utterly different approaches.
This all happened less than 2 hours ago, but a quick summary is that my Certificate Transparency monitor, Cert Spotter (https://sslmate.com/certspotter) performs various sanity checks on every certificate that it observes. At 15:41 UTC today, I started getting alerts that certificates from Let's Encrypt were failing one particular check. I quickly emailed Let's Encrypt's problem reporting address, and Let's Encrypt promptly suspended issuance so they could investigate. I've lost count of how many CAs I've detected having this particular problem, so perhaps it is time to blog about it (https://www.agwa.name/blog if you're interested).
They "lint" certificates before issuance, as do most CAs. However, I don't think any linters check for this problem, as it requires access to more than just the certificate (the linter would need access to either the precertificate or a database of Certificate Transparency log keys).
We will add a lint to Boulder for precertificate and certificate correspondence to ensure this class of problem never happens again.
It would be nice to add this to Zlint, but we'd need a new interface that could be given both a precertificate and certificate to co-lint. Other than this one correspondence check, I'm not sure if there's any other lints that would fit that pattern.
The problem is actually WAY more subtle, and pretty hard to understand unless you really get in the weeds of Certificate Transparency and certificate policy, but I'll give a shot at providing a concise explanation.
Let's Encrypt has produced two signed artifacts with the same serial number:
A precertificate is not a certificate, but it implies the existence of a corresponding certificate which can be constructed by applying an algorithm to the precertificate.
Let's Encrypt intended to create a precertificate which would result in (2) when applying the algorithm to (1). Unfortunately, applying the algorithm to (1) results in a different certificate, (3), presumably because of some bug in Let's Encrypt. Since (2) and (3) have the same serial number, it's a violation of the prohibition against duplicate serial numbers.
An easier-to-understand description of the problem is that Let's Encrypt was producing precertificates that didn't match the final certificate, but the compliance violation is duplicate serial numbers, which is why I worded my compliance bug the way I did.
If applying the algorithm to (1) produced (3), what produced (2)?
How can "no duplicate serial numbers" be enforced by any browser without having a store of all certificates? Is it simply a best-effort? Will the browser have a mapping from <serial number> to <certificate>, and whenever it sees a certificate, it will check this map to see if it has seen that serial number on a separate certificate?
> If applying the algorithm to (1) produced (3), what produced (2)?
I believe the root of the problem is that Let's Encrypt is creating certificates and precertificates independently, instead of creating a precertificate and then applying the algorithm to create the corresponding certificate. Since their processes for certificates and precertificates got out-of-sync, they ended up producing (2) instead of (3).
> How can "no duplicate serial numbers" be enforced by any browser without having a store of all certificates?
Browser software doesn't enforce this. It can only be enforced by scanning Certificate Transparency logs looking for violations.
Indeed, although the de jure requirement in the policy isn't actually important per se, the only practical way to obey the policy is to do the thing we want you to do, so that's what you'll actually do, but the way the policy is phrased makes enforcement practical.
This is different from a "Brown M&M" policy where the purpose of the policy is to easily check that you are actually reading and obeying the policy document. Here the policy is worded in a way that doesn't directly achieve what we want, but is measurable, whereas what we want isn't, but the only practical way to achieve policy compliance is to do what we wanted anyway.
Firefox does incidentally error if it sees duplicate serial numbers from the same issuer, though it wouldn't detect this case since the browser won't see precertificates in a TLS handshake.
I don't think this is intended to be a security feature, but simply an error from the depths of NSS where some code uses (issuer, serial) as a unique index.
Requiring Certificate Transparency in the browser doesn’t directly prevent this, but (as in this instance) it ensures there is public data anyone can check to see if this situation has occurred.
They will indeed need to revoke the affected certificates within 5 days, per the Baseline Requirements.
Also, the affected certificates won't be accepted by Certificate Transparency-enforcing browsers (Chrome and Safari) because of the precertificate mismatch.
Interestingly, Chrome does seem to be accepting the affected certificates based on some limited testing. I'm not sure why that would be, though. Chrome certainly requires SCTs to be present, but perhaps isn't (always?) checking the SCTs signatures over the certificates. Once we've wrapped this incident up, I'll have to follow up on that.
Safari does seem to be rejecting them as expected.
I picked 10 affected sites at random, and Chrome correctly rejected every one with ERR_CERTIFICATE_TRANSPARENCY_REQUIRED.
If Chrome isn't validating SCT signatures, that's a serious bug that would enable bypass of CT enforcement, so I'm sure the Chrome team would want to hear more about what you've seen.
Firefox, the browser, only cares if the certificate is valid (not expired, not revoked, ultimately signed by a root CA it trusts). It does not keep tabs on every certificate ever issued. You wouldn't like it if Firefox did an online check with a central authority for every website you visited, nor would you like it to bundle every single certificate ever issued (or even just serial numbers).
Mozilla, the authors of the browser, are part of the CA/Browser Forum, which holds the threat of complete distrust in all web browsers against CAs, which compels CAs to be open and provide logs of all the certificates they've issued and prove they're not mis-issuing certificates. All those extra checks happen here.
> You wouldn't like it if Firefox did an online check with a central authority for every website you visited
Enforcing Certificate Transparency does not require doing an online check for every website you visit.
> Mozilla, the authors of the browser, are part of the CA/Browser Forum, which holds the threat of complete distrust in all web browsers against CAs, which compels CAs to be open and provide logs of all the certificates they've issued and prove they're not mis-issuing certificates.
The CA/Browser Forum does not require CAs to log the certificates that they issue. CT is enforced entirely within the certificate validator code, and it is a major shortcoming that Firefox does not do it.
You can embed CT attestations (SCTs) in the certificate itself, so yes, provided the CA is in cooperation with CT log operators, and deliberately does the pre-certificate -> SCTs -> real certificate dance, it is possible for a browser to validate embedded SCTs without an online check.
However, that assumes that the CA actively does that, they don't have to. Neither does the server. What's compelling them to is _policy_, set by Google and Apple, that their respective browsers won't accept certificates _without_ CT attestations. Google's policy specifically requires that one of the SCTs on a certificate must be a CT log run by Google. Google also controls the list of CT logs that Chrome will consider as valid CT logs, as part of deciding if an SCT is valid. Antitrust, anyone?
I was trying to make a similar point about Firefox - policy vs code. And rather than saying that it's specifically the CA/Browser Forum setting policy (which it does, but only baseline policy, which does not include CT), each org in the CA/Browser Forum has their own root cert inclusion program with their own policies, that all draw from baseline policy then add to it. You are right, _baseline_ policy does not require CT....
... and neither does _Mozilla's_ policy, now I've scanned through it. It actively acknowledges that CT exists (in that it mandates that if you issue a precertificate for CT, you _must_ issue the completed certificate), but it does _not_ require CAs to use CT. In stark contrast to Google and Apple.
Perhaps this is why they also don't implement CT checking in Firefox?
> It actively acknowledges that CT exists (in that it mandates that if you issue a precertificate for CT, you _must_ issue the completed certificate)
I don't think that's what the document says. I don't see a requirement to issue the final certificate. This portion is putting pre-certificates into scope of the agreement in that a mis-issued pre-certificate is evidence of intent to mis-issue a final certificate. So, before issuing a pre-certificate, a CA has to be prepared to revoke the final certificate, even if they never actually issue the final certificate; as well as prepared to defend the issuing of the final certificate.
Presumably, this is to cover from CAs claiming a pre-certificate was issued for testing only, and wasn't going to be issued as a final certificate. Also, I'd presume that a CA issuing pre-certificates so they could embed SCTs would abort issuance if they were unable to get a response from the certificate log, but there's always the chance that the submission went fine and the pre-certificate is logged, but the response didn't make it, so the CA would abort.
There is a major distinction between root store policy and CT policy which you are missing.
Root store policy contains requirements which are enforced by audits, and if a CA violates the root store policy it is considered misissuance requiring them to revoke the offending certificates and file an incident report. Neither Chrome nor Apple root store policies require CT.
CT policy describes what CAs must do for their certificates to be accepted by the certificate validation code. CT policy is enforced entirely by code. It is not an incident if a CA doesn't comply with CT policy; it just means their certificates won't be accepted.
Certificate Transparency verification only requires the server provide stapled proof (either to the certificate itself or a OCSP response) that the certificate was submitted to the public logs. It does not involve any extra requests from the browser to a third party; at most it involves a periodic request from the server to the CA with no client-specific data.
I wonder how much this has to do with OCSP stapling being so badly implemented in Apache 2 and nginx (don't know about the other servers). This article from 2009 [1] still seems current, at least I can attest that I still have issues with nginx. Also this super user Q&A [2] which suggest priming OCSP with a cron job because nginx does not do its job by itself.
Because it's distracting, honestly. Anyone could say "Caddy does this" and they would objectively be correct. The point is about ACME client features and giving an objectively true example.
That wouldn't work for caddy if you also follow the best practice to have a CAA record pointing to the issuer and account URL, unless caddy is also managing DNS records in addition to being an HTTP server. (I don't know if it is, but I would think it's a layering violation for an HTTP server to also be a DNS server.)
> (I don't know if it is, but I would think it's a layering violation for an HTTP server to also be a DNS server.)
Caddy 2 is, at its core, a server of servers. The HTTP server is just an "app module" for Caddy. There are other servers; I don't know of a DNS server app yet. (CoreDNS is a fork of Caddy v1, though.)
While we're on the topic, I will say that Caddy has independently and repeatedly been cited as the gold standard of ACME clients, "the best client experience," and "we hope to see other servers follow Caddy's lead." [0] [1] [and others I don't have links to currently].
Sure, there are a handful of useful points there (off the top of my head: it's possible to work around this failure by using multiple CAs, multiple free ACME CAs exist, caddy implements this solution). I'm just 1. slightly frustrated that caddy's author never seems to miss a chance for self-promotion (at least he's started alluding to the fact that it's his project), and 2. actually curious whether any other ACME clients are implementing that fallback.
> I'm just 1. slightly frustrated that caddy's author never seems to miss a chance for self-promotion (at least he's started alluding to the fact that it's his project),
Sorry that I happen to be the author. It's really not about that though -- it just matters that an ACME-native HTTPS server exists. We need more integrated fully-native ACME clients.
> actually curious whether any other ACME clients are implementing that fallback.
There are at least one or two others. I don't recall which ones at the moment but I think Certify the Web may be one. Edit: mod_md is another apparently!
Right, it's astounding to me that outfits like Microsoft didn't just immediately ship decent ACME implementations. People seem to have settled for third party bolt-on solutions. It's like you wake up in an alternate world where yeah, no cars come with seat belts, but of course everybody buys seatbelts for the car, there's usually a store next to the car dealer which sells them. Um. What?
Most of the "popular" software in this space is garbage. I spent the entire day today (aside from meetings and helping other people debug problems) wrestling with the fact Apache seems to be designed so heavily with a C programmer mindset that even the idea of reporting problems has never occurred to them. Just blunder on, it'll be fine, don't think about it. You can sprinkle complete nonsense into Apache configuration files and, until you trip an actual syntactical error and blow up their parser, Apache just presses on anyway with the nonsense values you provided, and if that doesn't work, no reason to report it just do whatever was the default and hope that's OK.
As far as I can tell, in the wild the result is a lot of Apache configuration is complete nonsense, but hey no errors are reported, so, copy, paste, move on.
Apache mod_md has fallback too, https://github.com/icing/mod_md#acme-failover I'm just a user, not the author, and I didn't try the fallback. I'm more worried about stuff breaking if I switch issuers than certs expiring without me noticing. I've got some embedded junk that hits my website and has weak cert validation, so better to stick with something that works.
Does mholt consider it a mistake? I'm aware that it was reverted (https://github.com/caddyserver/caddy/pull/1866) and that mholt found the whole thing difficult (which is my attempt to neutrally summarize https://caddy.community/t/the-realities-of-being-a-foss-main... accurately), but that is a somewhat different statement. If so, then yes, it's unkind and unhelpful to keep bringing it up, but if no then it's useful to keep previous behavior in mind when evaluating the product.
Yes, it was a mistake. (Source: I'm a Caddy maintainer, and it comes up in our discussions from time to time.) The reason it was done was because Caddy needed some source of revenue since Matt made it his full time job, and he assumed the sponsors thought they would appreciate the extra promotion (and as you can see at the bottom of that github link, he tried to alert them but received no feedback). Of course, thinking that no feedback was an implicit "sure" was a lapse in judgment, but we're all human.
Remember, this was six years ago. That's an eternity in this industry. Caddy is a very different project than it was then, and Matt has a different and more stable revenue stream than he did then. We can promise we'll never attempt the same thing again.
But seriously, this comes up in like one in ten HN threads where Matt comments, it's exhausting to keep telling people "okay can you please forget what you remember from 6 years ago and look at the project for what it is now?"
Yes, I understand the reasons why caddy did all the things I objected to; money is actually important, telemetry can be useful to devs, and the early (non) packaging decisions were clearly meant to optimize the on-ramp. But just because I understand a decision doesn't mean I agree, and doesn't mean I'm not going to include that information in my own decision to avoid a program.
> Remember, this was six years ago. That's an eternity in this industry. Caddy is a very different project than it was then, and Matt has a different and more stable revenue stream than he did then. We can promise we'll never attempt the same thing again.
I think this was meant to be reassuring, but it really makes it sounds more like it was purely a pragmatic thing. Okay, so now Caddy has stable cash flow, so no adware. Next year the economy lurches and the money goes away; is caddy going to start making awkward decisions again?
> But seriously, this comes up in like one in ten HN threads where Matt comments, it's exhausting to keep telling people "okay can you please forget what you remember from 6 years ago and look at the project for what it is now?"
You know that line about how people will forget what you do, but not how you made them feel? I remember exactly how I felt when the wonderful server software I was using decided to start shipping adware. And now, having backed off but never actually apologized, you want people to just forget about the whole thing? That's not how it works. Edit: Now that we've had this exchange, and at least you have called it a mistake and said it won't happen again, I can update my evaluation based on that. I would suggest that saying that six years ago in the announcements channel would have reduced the number of times you needed to have this conversation.
Edit2: Realized there was a much more succinct way of answering: If someone feels that you wronged them, you don't get to choose when they get over it. A lot of users felt that caddy treated them poorly. And honestly, even if the project had said then what you're saying now, some of them would still remember that.
If your certificate was issued after the start of the incident but before Let's Encrypt suspended issuance, then the certificate is currently not working in Chrome or Safari.
This incident wasn't just about downtime, it was also about issuing non-functional/non-compliant certificates.
Is this because the certs were revoked? (Revocation is broken ;P)
Caddy staples Valid OCSP responses to all certificates that have an OCSP responder, so if browsers aren't accepting that, then arguably the clients are broken, because that response is valid until a few days from now. But before the 100% valid and trusted OCSP staple expires, Caddy will get a new staple that presumably says Revoked, and replace them right away before browsers would ever see a Revoked status.
I wonder if we should be doing some basic sanity checks on newly obtained certificates in Caddy, and treat this as a failure, and try the next configured CA instead.
(Obviously SCT signatures will require some external resource so we would have to weigh that a bit more, maybe make it configurable...)
Yeah, the question is how far you want to go. To be safe against every possible CA screwup, you basically have to re-implement every browser's entire certificate validation engine and run certificates through each one. That would obviously be very hard, and could do more harm than good if it falls out-of-sync with browsers.
It would make sense to try to check this, if there's a reasonable way to access a database of trusted certificate logs. If not, it's going to be tricky; I wouldn't fail a certificate that had a SCT from an unknown log, because it might be valid and you don't know. Etc.
You can trust the ACME CAs listed on this site: https://www.acmeisuptime.com/ (Although, I think that list could use some updating. I'll ping the author.)
Personally I would use Let's Encrypt, ZeroSSL (Sectigo) and Google Trust Services. There are, of course, others. But which ones you choose depend on your requirements and such. (Some offer business support, for example.) SSL.com and Sectigo also offer ACME but I am not sure how performant their CA software is.
On the bright side, that's actually one of the lower-impact things to have an outage on, IMO; if you're using it the recommended way, an outage would only really affect new certs, with older certs just getting renewed slightly later.
What's the impact of an outage like this? ACME renewals should happen daily starting 30 days before expiry so no one should have had a cert expire due to this. New certificates wouldn't have been issued, so that's impact, although I suspect most new certs aren't taking traffic immediately (i.e. setting up a new server).