More

c0l0 · 2025-12-19T20:37:11 1766176631

I came here to post this, too :) What the thingino community managed to do with their firmware for these cameras is nothing short of amazing - if you happen to have a compatible camera, you really, really should give it a whirl!

kqr · 2025-12-20T06:23:16 1766211796

I'd love to but... how? One alternative seems to be a programmer chip that must be puchased and then modified to not fry the camera with 5V. Another is maybe stripping a USB cable and soldering it to the wifi pads on the camera chip?

Neither of these seem like good ideas for someone like me, who is relatively hardware naïve and has small children running around making it hard to concetrate for more than 30 minutes at a time.

The question is genuine. I want to do this but don't actually know by which method.

c0l0 · 2025-12-20T16:55:42 1766249742

Yeah, I can see why that is a show-stopper for people. However, the thingino project has people among them who care deeply about ease of installation - so with these security issues discovered in the TP-Link device, chances are an installation method that relies on a vulnerable stock firmware will be provided in time :)

inferiorhuman · 2025-12-20T07:53:16 1766217196

I got a couple of Wyze cameras and loaded Thignino via SD card. No fuss no muss.

kqr · 2025-12-20T08:16:12 1766218572

In this case I'm asking specifically about the C200 this article is about. Sorry for not being more clear. From what I understand the C200 does not boot from SD card.

inferiorhuman · 2025-12-20T09:32:52 1766223172

Ah that's fair. One of the reasons I went with the Wyze units is that they were well supported and installation was pretty easy.

defraudbah · 2025-12-20T09:13:52 1766222032

correct, it's in beta testing right now, you can check for alternatives https://github.com/wltechblog/thingino-installers

rescbr · 2025-12-19T22:11:47 1766182307

Oh, this is great! I do have this exact camera and another one that’s on the list!

I’m more than happy to ditch the scrappy RTSP setup that I have to support these cheap cameras!

inferiorhuman · 2025-12-20T07:56:56 1766217416

I think Thingino is great. But there are definitely still dragons lurking. I reported a bug last year and mostly forgot about it. Got a response a few months ago to check out a fix related to unexpected memory access.

I generally try not to be a huge Rust cheerleader but seriously. Yikes.

c0l0 · 2025-12-19T20:13:46 1766175226

I realize this is mostly tangential to the article, but a word of warning for those who are about to mess with overcommit for the first time: In my experience, the extreme stance of "always do [thing] with overcommit" is just not defensible, because most (yes, also "server") software is just not written under the assumption that being able to deal with allocation failures in a meaningful way is a necessity. At best, there's an "malloc() or die"-like stanza in the source, and that's that.

You can and maybe even should disable overcommit this way when running postgres on the server (and only a minimum of what you would these days call sidecar processes (monitoring and backup agents, etc.) on the same host/kernel), but once you have a typical zoo of stuff using dynamic languages living there, you WILL blow someone's leg off.

kg · 2025-12-19T20:23:21 1766175801

I run my development VM with overcommit disabled and the way stuff fails when it runs out of memory is really confusing and mysterious sometimes. It's useful for flushing out issues that would otherwise cause system degradation w/overcommit enabled, so I keep it that way, but yeah... doing it in production with a bunch of different applications running is probably asking for trouble.

Tuna-Fish · 2025-12-19T20:41:32 1766176892

The fundamental problem is that your machine is running software from a thousand different projects or libraries just to provide the basic system, and most of them do not handle allocation failure gracefully. If program A allocates too much memory and overcommit is off, that doesn't necessarily mean that A gets an allocation failure. It might also mean that code in library B in background process C gets the failure, and fails in a way that puts the system in a state that's not easily recoverable, and is possibly very different every time it happens.

For cleanly surfacing errors, overcommit=2 is a bad choice. For most servers, it's much better to leave overcommit on, but make the OOM killer always target your primary service/container, using oom-score-adj, and/or memory.oom.group to take out the whole cgroup. This way, you get to cleanly combine your OOM condition handling with the general failure case and can restart everything from a known foundation, instead of trying to soldier on while possibly lacking some piece of support infrastructure that is necessary but usually invisible.

MrDrMcCoy · 2025-12-20T03:26:31 1766201191

There's also cgroup resource controls to separately govern max memory and swap usage. Thanks to systemd and systemd-run, you can easily apply and adjust them on arbitrary processes. The manpages you want are systemd.resource-control and systemd.exec. I haven't found any other equivalent tools that expose these cgroup features to the extent that systemd does.

b112 · 2025-12-20T07:27:57 1766215677

I really dislike systemd, and its monolithic mass of over-engineered, all encompassing code. So I have to hang a comment here, showing just how easy this is to manage in a simple startup script. How these features are always exposed.

Taken from a SO post:

  # Create a cgroup
  mkdir /sys/fs/cgroup/memory/my_cgroup
  # Add the process to it
  echo $PID > /sys/fs/cgroup/memory/my_cgroup/cgroup.procs
  
  # Set the limit to 40MB
  echo $((40 \* 1024 \* 1024)) > /sys/fs/cgroup/memory/my_cgroup/memory.limit_in_bytes

Linux is so beautiful. Unix is. Systemd is like a person with makeup plastered 1" thick all over their face. It detracts, obscures the natural beauty, and is just a lot of work for no reason.

ece · 2025-12-19T23:29:12 1766186952

This is a better explanation and fix than others I've seen. There will be differences between desktop and server uses, but misbehaving applications and libraries exist on both.

vin10 · 2025-12-19T20:35:09 1766176509

> he way stuff fails when it runs out of memory is really confusing

have you checked what your `vm.overcommit_ratio` is? If its < 100%, then you will get OOM kills even if plenty of RAM is free since the default is 50 i.e. 50% of RAM can be COMMITTED and no more.

curious what kind of failures you are alluding to.

kg · 2025-12-20T01:16:55 1766193415

The main scenario that caused me a lot of grief is temporary RAM usage spikes, like a single process run during a build that uses ~8gb of RAM or more for a mere few seconds and then exits. In some cases the oom killer was reaping the wrong process or the build was just failing cryptically and if I examined stuff like top I wouldn't see any issue, plenty of free RAM. The tooling for examining this historical memory usage is pretty bad, my only option was to look at the oom killer logs and hope that eventually the culprit would show up.

Thanks for the tip about vm.overcommit_ratio though, I think it's set to the default.

PunchyHamster · 2025-12-20T02:36:44 1766198204

you can get statistics off cgroups to get idea what it was (assuming it's a service and not something user ran), but that requires probing it often enough

bawolff · 2025-12-20T08:14:50 1766218490

> At best, there's an "malloc() or die"-like stanza in the source, and that's that.

In fairness, i don't know what else general purpose software is supposed to do here other than die. Its not like there is a graceful way to handle insufficient memory to run the program.

jenadine · 2025-12-20T09:48:32 1766224112

In theory, a process could just return an error for that specific operation, which would propagate to a "500 internal error" for this one request but not impact other operations. Could even take the hint to free some caches.

But in practice, I agree with you. This is just not worth it. So much work to handle it properly everywhere and it is really difficult to test every malloc failures.

So that's where an OOM killer might have a better strategy than just letting the last program that happen to allocate memory last to fail.

c0l0 · 2025-12-14T21:13:27 1765746807

In fact, it is not a problem at all.

Let new generations of Free Software orgs come along and supplant GNU with a GBIR (GNU But In Rust), but don't insist on existing, established things that are perfectly good for who and what they are to change into whatever you prefer at any given moment.

c0l0 · 2025-11-24T08:15:16 1763972116

I wrote https://johannes.truschnigg.info/writing/2024-07-impending_g... in response to the CrowdStrike fallout, and was tempted to repost it for the recent CloudFlare whoopsie. It's just too bad that publishing rants won't change the darned status quo! :')

graemep · 2025-11-24T09:47:57 1763977677

People will not do anything until something really disastrous happens. Even afterwards memories can fade. Cloudstrike has not lost many customers.

Covid is a good parallel. A pandemic was always possible, there is always a reasonable chance of one over the course of decades. However people did not take it seriously until it actually happened.

A lot of Asian countries are a lot better prepared for a tsunami then they were before 2004.

The UK was supposed to have emergency plans for a pandemic, but it was for a flu variant, and I suspect even those plans were under-resourced and not fit for purpose. We are supposed to have plans for a solar storm but when another Carrington even occurs I very much doubt we will deal with it smoothly.

c0l0 · 2025-11-06T13:38:58 1762436338

That's one cute lil' fella :)

c0l0 · 2025-10-12T19:31:22 1760297482

Very cool project - hoping to see follow-up designs that can do more than 1Gbps per port!

I recently built a fully Layer2-transparent 25Gbps+ capable wireguard-based solution for LR fiber links at work based on Debian with COTS Zen4 machines and a purpose-tailored Linux kernel build - I'd be curious to know what an optimized FPGA can do compared to that.

dpeckett · 2025-10-13T05:22:48 1760332968

How did you work around WireGuard's encryption and multiqueue bottlenecks? Jumbo frames?

25G is a lot for WireGuard [1].

1. https://www.youtube.com/watch?v=oXhNVj80Z8A

c0l0 · 2025-10-13T05:52:18 1760334738

Yes, Jumbo frames unlock a LOT of additional performance - which is exactly what we have and need on those links. Using a vanilla wg-bench[0] loopback-esque (really veths across network namespaces) setup on the machine, I get slightly more than 15Gbps sustained throughput.

[0]: https://github.com/cyyself/wg-bench

superxpro12 · 2025-10-13T05:50:03 1760334603

Its probably a 48port switch and that's a backplane claim.

Hikikomori · 2025-10-12T19:43:36 1760298216

When macsec exists?

bc569a80a344f9c · 2025-10-12T20:00:22 1760299222

No kidding.

Just to elaborate for others, MACSec is a standard (802.1ae) and runs at line rate. Something like a Juniper PTX10008 can run it at 400Gbps, and it’s just a feature you turn on for the port you’d be using for the link you want to protect anyway (PTXs are routers/switches, not security devices).

If I need to provide encryption on a DCI, I’m at least somewhat likely to have gear that can just do this with vendor support instead of needing to slap together some Linux based solution.

Unless, I suppose, there’s various layer 2 domains you’re stitching together with multiple L2 hops and you don’t control the ones in the middle. In which case I’d just get a different link where that isn’t true.

tecleandor · 2025-10-13T09:48:21 1760348901

I have at least one switch that's MACSec compatible at line speed but I haven't had time to take a look. I guess this is confined to LAN and cannot do a MACSec link through the internet, isn't it?

bc569a80a344f9c · 2025-10-13T10:54:44 1760352884

It’s port to port. It protects a link.

tecleandor · 2025-10-13T14:14:32 1760364872

Thanks!

justsomehnguy · 2025-10-13T20:22:51 1760386971

https://news.ycombinator.com/item?id=41531699

'Testing MACsec' paragraph, especially around

>> Last but not least, ... so that half of the traffic ... was being sent fully unencrypted ...

ur-whale · 2025-10-12T22:16:06 1760307366

> When macsec exists?

When you say "exists" ... is there an OpenSource high-quality implementation ?

Hikikomori · 2025-10-12T22:24:02 1760307842

https://man7.org/linux/man-pages/man8/ip-macsec.8.html

Generally its used when you have links going between two of your sites, so you typically only need it on your switch or router that terminate that link.

c0l0 · 2025-10-13T05:55:46 1760334946

Yeah that would have been great, but it's not available on our existing core switches (Dell PowerSwitch S5200 series).

esbeeb · 2025-10-12T23:40:11 1760312411

This is a flex!

c0l0 · 2025-09-07T08:55:11 1757235311

I realize this has not much to do with CPU choice per se, but I'm still gonna leave this recommendation here for people who like to build PCs to get stuff done with :) Since I've been able to afford it and the market has had them available, I've been buying desktop systems with proper ECC support.

I've been chasing flimsy but very annoying stability problems (some, of course, due to overclocking during my younger years, when it still had a tangible payoff) enough times on systems I had built that taking this one BIG potential cause out of the equation is worth the few dozens of extra bucks I have to spend on ECC-capable gear many times over.

Trying to validate an ECC-less platform's stability is surprisingly hard, because memtest and friends just aren't very reliably detecting more subtle problems. PRIME95, y-cruncher and linpack (in increasing order of effectiveness) are better than specialzied memory testing software in my experience, but they are not perfect, either.

Most AMD CPUs (but not their APUs with potent iGPUs - there, you will have to buy the "PRO" variants) these days have full support for ECC UDIMMs. If your mainboard vendor also plays ball - annoyingly, only a minority of them enables ECC support in their firmware, so always check for that before buying! - there's not much that can prevent you from having that stability enhancement and reassuring peace of mind.

Quoth DJB (around the very start of this millenium): https://cr.yp.to/hardware/ecc.html :)

dijit · 2025-09-07T09:08:21 1757236101

> only a minority of them enables ECC support in their firmware, so always check for that before buying!

This is the annoying part.

That AMD permits ECC is a truly fantastic situation, but if it's supported by the motherboard is often unlikely and worse: it's not advertised even when it's available.

I have an ASUS PRIME TRX40 PRO and the tech specs say that it can run ECC and non-ECC but not if ECC will be available to the operating system, merely that the DIMMS will work.

It's much more hit and miss in reality than it should be, though this motherboard was a pricey one: one can't use price as a proxy for features.

sunshowers · 2025-09-08T05:19:23 1757308763

If you're on Linux, dmesg containing

  EDAC MC0: Giving out device to module amd64_edac

is a pretty reliable indication that ECC is working.

See my blog post about it (it was top of HN): https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/

oneshtein · 2025-09-08T06:35:25 1757313325

My `dmesg` tells:

    EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
    EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)

but `dmidecode --type 16` says:

    Error Correction Type: None
    Error Information Handle: Not Provided

sunshowers · 2025-09-08T06:39:26 1757313566

To be honest I don't know how Intel works, my post is limited to AMD.

c0l0 · 2025-09-08T12:23:40 1757334220

Are you sure you have ECC-capable DIMM installed?

What does

    find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +

report?

oneshtein · 2025-09-09T13:56:33 1757426193

AFAIK, I have 2x DDR5 non-ECC memory (`dmidecode --type 17` says Samsung M425R1GB4BB0-CQKOL). Your command tells about SECDEC (single bit error correction, double bit error detection).

    /sys/devices/system/edac/mc/mc0/csrow0/ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label:MC#0_Chan#0_DIMM#0
    /sys/devices/system/edac/mc/mc0/csrow0/size_mb:8192
    /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ue_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/mem_type:Unbuffered-DDR3
    /sys/devices/system/edac/mc/mc0/csrow0/edac_mode:SECDED
    /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ch1_dimm_label:MC#0_Chan#1_DIMM#0
    /sys/devices/system/edac/mc/mc0/csrow0/dev_type:x16

> find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +

It looks like DDR5 supports SECDEC by default. :-/

c0l0 · 2025-09-07T09:16:26 1757236586

Usually, if a vendor's spec sheet for a (SOHO/consumer-grade) motherboard mentions ECC-UDIMM explicitly in its memory compatibility section, and (but this is a more recent development afaict) DOES NOT specify something like "operating in non-ECC mode only" at the same time, then you will have proper ECC (and therefore EDAC and RAS) support in Linux, if the kernel version you have can already deal with ECC on your platform in general.

I would assume your particular motherboard to operate with proper SECDED+-level ECC if you have capable, compatible DIMM, enable ECC mode in the firmware, and boot an OS kernel that can make sense of it all.

adrian_b · 2025-09-08T06:02:15 1757311335

This is weird. I have used many ASUS MBs specified as "can run ECC and non-ECC" and this has always meant that there was an ECC enabling option in the BIOS settings, and then if the OS had an appropriate EDAC driver for the installed CPU ECC worked fine.

I am writing this message on such an ASUS MB with a Ryzen CPU and working ECC memory. You must check that you actually have a recent enough OS to know your Threadripper CPU and that you have installed any software package required for this (e.g. on Linux "edac-utils" or something with a similar name).

jml7c5 · 2025-09-08T06:45:33 1757313933

The big problem with ECC for me is that the sticks are so much more expensive. You'd expect ECC UDIMMs to have a price premium of just over 12.5% (because there are 9 chips instead of 8), but it's usually at least 100%. I don't mind paying reasonable premium for ECC, but paying double is too hard to swallow.

mr_toad · 2025-09-08T13:44:14 1757339054

Trouble with enterprise is that the people buying care about the technology, but not the cost, while the people that do care about cost don’t understand the technology.

Some businesses (and governments) try and unify their purchasing, but this seems to make things worse, with the purchasing department both not understanding technology and being outwitted by vendors.

thewebguyd · 2025-09-08T14:58:34 1757343514

> Trouble with enterprise is that the people buying care about the technology, but not the cost

Enterprise also ruins it for small/medium businesses as well, at least those with dedicated internal IT departments who do care about both the technology and the cost. We are left with unreliable consumer-grade hardware, or prohibitively expensive enterprise hardware.

There's very little in between. This market is also underserved with software/SaaS as well with the SSO Tax and whatnot. There's a huge gap between "I'm taking the owner's CC down to best buy" and "Enterprise" that gets screwed over.

wmf · 2025-09-08T15:30:45 1757345445

Enterprise IT is overpriced so you can negotiate a 50% discount. Unfortunately negotiating isn't worth it for something like a pair of DIMMs.

attila-lendvai · 2025-09-08T20:36:54 1757363814

Enterprise IT is overpriced because it's mostly people spending other people's money... which usually has a lot of turbulence in its flow...

sippeangelo · 2025-09-08T08:44:38 1757321078

Yeah, with that kind of markup you might as well just buy new ones IF they break, or just spend the extra budget on better quality parts. Just having to pick a very specific motherboard that probably is very much not optimal for your build will blow the costs up even more, and for what gain?

I've been building my own gaming and productivity rigs for 20 years and I don't think memory has ever been a problem. Maybe survivorship bias, but surely even budget parts aren't THIS bad.

lmm · 2025-09-08T08:53:22 1757321602

> with that kind of markup you might as well just buy new ones IF they break

Assuming you can tell, and assuming you don't end up silently corrupting your data before then.

coldtea · 2025-09-08T11:12:55 1757329975

You take backups and do checksums for the valuable data. For the rest, who cares...

cma · 2025-09-08T14:03:37 1757340217

Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.

Without knowing how to fix that error you've lost 200 revisions of work. You can go back and find which revision had the problem, go before that, and upgrade it to the latest blender, but all your 200 revisions were made on other versions that you can't backport.

fluoridation · 2025-09-08T19:00:58 1757358058

So don't upgrade it. Export it to an agnostic format and re-import it in the new version. Since it's failing to upgrade, it must be a metadata issue, not a data issue, so removing the Blender-specific bits will fix it.

What a silly hypothetical. There's a myriad freak occurrences that could make you have to redo work that you don't worry about. Now, I'm not saying single-bit errors don't happen. They just typically don't result in the sort of cascading failure you're describing.

cma · 2025-09-08T20:22:51 1757362971

Doing a lossy export/reimport process probably isn't going to be viable on something like a big movie scene blender file with lots of constraints, scripted modifiers and stuff that doesn't automatically come through with an export to USD.

My point is that there are scenarios where corruption in the past puts you in a bind and can cause a lot of loss of work or expensive diagnostic and recovery process long after it first occurred, blender was just one example but it can be much worse with proprietary software binary formats where you don't have any chance of jumping into the debugger to figure out what's going wrong with an upgrade or export. And maybe the subscription version of it won't even let you go back to the old version.

> There's a myriad freak occurrences that could make you have to redo work that you don't worry about.

Yes other sources of corruption are more likely from things like software errors. It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc., but you do have a budget and ECC is much cheaper relative to that. That doesn't mean it always makes sense for everyone to pay more for ECC. But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.

fluoridation · 2025-09-08T20:54:21 1757364861

>a big movie scene blender file with lots of constraints, scripted modifiers and stuff

Not really what I would call an "asset", but fine.

>It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc.

Hell, I was thinking something way simpler, like your cat climbing on the case and throwing up through the top vents, or you tripping and dropping your ass on your desk and sending everything flying.

>But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.

Yeah, because those people aren't buying their own machines. If the credit card is yours and you're not doing something super critical, you're probably better served by a faster processor than by worrying against freak accidents.

coldtea · 2025-09-08T18:47:45 1757357265

>Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.

And let's say you have archived copies of it with checksums like I suggested, going back to all revisions ago.

What's the issue again now, that ECC would have solved? Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.

fluoridation · 2025-09-08T19:05:07 1757358307

>What's the issue again now, that ECC would have solved?

If the bit flip happened in RAM, the checksum would be of the corrupted data. ECC corrects single bit errors of data on RAM.

>Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.

Yes, using ECC without ZFS, btrfs, ReFS, or checksummed file formats is pretty pointless (unless your application never touches storage).

varispeed · 2025-09-08T13:46:26 1757339186

You would think that competition would naturally regulate the price down, but it seems like we are dealing with some sort of a cartel that regulators have not caught up with yet.

consp · 2025-09-07T09:11:25 1757236285

Isn't it mostly an ease of mind thing? I've never seen a ECC error on my home server which has plenty of memory in use and runs longer than my desktop. Maybe it's more common with higher clocked, near the limit, desktop PC's.

Also: DDR5 has some false ecc marketing due to the memory standard having an error correction scheme build in. Don't fall for it.

adrian_b · 2025-09-08T06:12:52 1757311972

Whether you will see ECC errors depends a lot on how much memory you have and how old it is.

A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.

When DIMMs are new, at the usual amounts of memory for desktops, you will see at most a few errors per year, sometimes only an error after a few years. With old DIMMs, some of them will start to have frequent errors (such modules presumably had a borderline bad fabrication quality and now have become worn out, e.g. due to increased leakage leading to storing a lower amount of charge on the memory cell capacitors).

For such bad DIMMs, the frequency of errors will increase, and it may become of several errors per day, or even per hour.

For me, a very important advantage of ECC has been the ability to detect such bad memory modules (in computers that have been used for 5 years or more) and replace them before corrupting any precious data.

I also had a case with a HP laptop with ECC, where memory errors had become frequent after being stored for a long time (more than a year) in a rather humid place, which might have caused some oxidation of the SODIMM socket contacts, because removing the SODIMMs, scrubbing the sockets and reinserting the SODIMMs made disappear the errors.

fluoridation · 2025-09-08T19:10:00 1757358600

>A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.

No. Or well, not exactly. More bits will flip randomly, but if between the two systems only the total installed memory changed, both systems will see the same amount of memory errors, because bit flips on the additional 48 GB will not result in errors, because they will not be used. Memory errors scale with memory used not with memory installed.

tiberious726 · 2025-09-14T21:14:29 1757884469

The extra unused memory might even act as shielding to cosmic rays, but the extra electrical load on the memory controller might more than balance that out for unbuffered sticks

c0l0 · 2025-09-07T09:24:43 1757237083

I see a particular ECC error at least weekly on my home desktop system, because one of my DIMMs doesn't like the (out of spec) clock rate that I make it operate at. Looks like this:

    94 2025-08-26 01:49:40 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68aea758, cpuid=0x00a50f00, bank=0x00000012
    95 2025-09-01 09:41:50 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68b80667, cpuid=0x00a50f00, bank=0x00000012

(this is `sudo ras-mc-ctl --errors` output)

It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC.

Hendrikto · 2025-09-07T09:54:20 1757238860

Running your RAM so far out of spec that it breaks down regularly, where do you take the confidence that ECC will still work correctly?

Also: Could you not have just bought slightly faste RAM, given the premium for ECC?

c0l0 · 2025-09-07T12:38:52 1757248732

"Breaks down" is a strong choice of words for a single, corrected bit error. ECC works as designed, and demonstrates that it does by detecting this re-occurring error. I take the confidence mostly from experience ;)

And no, as ECC UDIMM for the speed (3600MHz) I run mine at simply does not exist - it is outside of what JEDEC ratified for the DDR4 spec.

adithyassekhar · 2025-09-07T12:57:31 1757249851

JEDEC rated DDR4 at only 2400mhz right? And anything higher is technically over clocking?

dijit · 2025-09-07T14:16:25 1757254585

JEDEC has a few frequencies they support: https://www.jedec.org/standards-documents/docs/jesd79-4a

DDR4-1600 (PC4-12800)

DDR4-1866 (PC4-14900)

DDR4-2133 (PC4-17000)

DDR4-2400 (PC4-19200)

DDR4-2666 (PC4-21300)

DDR4-2933 (PC4-23466)

DDR4-3200 (PC4-25600) (the highest supported in the DDR4 generation)

What's *NOT* supported are some enthusiast ones that typically require more than 1.2v for example: 3600 MT/s, 4000 MT/s & 4266 MT/s

c0l0 · 2025-09-07T13:09:49 1757250589

JEDEC specifies rates up to 3200MT/s, what's officially referred to as DDR4-3200 (PC4-25600).

kderbe · 2025-09-07T17:10:33 1757265033

I would loosen the memory timings a bit and see if that resolves the ECC errors. x265 performance shouldn't fall since it generally benefits more from memory clock rate than latency.

Also, could you share some relevant info about your processor, mainboard, and UEFI? I see many internet commenters question whether their ECC is working (or ask if a particular setup would work), and far fewer that report a successful ECC consumer desktop build. So it would be nice to know some specific product combinations that really work.

c0l0 · 2025-09-08T06:01:17 1757311277

I've been on AM4 for most of the past decade (and still am, in fact), and the mainboards I've personally had in use with working ECC support were:

  - ASRock B450 Pro4
  - ASRock B550M-ITX/ac
  - ASRock Fatal1ty B450 Gaming-ITX/ac
  - Gigabyte MC12-LE0

There's probably many others with proper ECC support. Vendor spec sheets usually hint at properly working ECC in their firmware if they mention "ECC UDIMM" support specifically.

As for CPUs, that is even easier for AM4: Everything that's not based on a APU core (there are some SKUs marketed without iGPU that just have the iGPU part of the APU disabled, such as the Ryzen 5 5500) cannot support ECC. An exception to that rule are "PRO"-series APUs, such as the Ryzen 5 PRO 5650G et al., which have an iGPU, but also support ECC. Main differences (apart from the integrated graphics) between CPU and APU SKUs is that the latter do not support PCIe 4.0 (APUs are limited to PCIe 3.0), and have a few Watts lower idle power consumption.

When I originally built the desktop PC that I still use (after a number of in-place upgrades, such as swapping out the CPU/GPU combo for an APU), I blogged about it (in German) here: https://johannes.truschnigg.info/blog/2020-03-23#0033-2020-0...

If I were to build an AM5 system today, I would look into mainboards from ASUS for proper ECC support - they seem to have it pretty much universally supported on their gear. (Actual out-of-band ECC with EDAC support on Linux, not the DDR5 "on-DIE" stuff.)

ainiriand · 2025-09-08T03:55:56 1757303756

I think you've found a particularly weak memory cell, I would start thinking about replacing that module. The consistent memory_channel=1, csrow=0 pattern confirms it's the same physical location failing predictably.

wpm · 2025-09-08T05:16:03 1757308563

I had a somewhat dodgy stick of used RAM (DDR4 UDIMM) in a Supermicro X11 board. This board is running my NAS, all ZFS, so RAM corruption can equal data corruption. The OS alerted me to recoverable errors on DIMM B2. Swapped it and another DIMM, rebooted, saw DIMM error on slot B1. Swapped it for a spare stick. No more errors.

This was running at like, 1866 or something. It's a pretty barebones 8th gen i3 with a beefier chipset, but ECC still came in clutch. I won't buy hardware for server purposes without it.

immibis · 2025-09-07T12:21:04 1757247664

I saw a corrected memory error logged every few hours when my current machine was new. It seems to have gone away now, so either some burn-in effect, or ECC accidentally got switched off and all my data is now corrupted. Threadripper 7000 series, 4x64GB DDR5.

Edit: it's probably because I switched it to "energy efficiency mode" instead of "performance mode" because it would occasionally lock up in performance mode. Presumably with the same root cause.

Jach · 2025-09-08T12:38:06 1757335086

I have a slightly older system with 128 GB of UDIMM DDR4 over four sticks. Ran just fine for quite a while but then I started having mysterious system freezes. Later discovered I had somehow disabled ECC error reporting in my system log on linux... once that was turned back on, oh, I see notices of recoverable errors. I finally found a repeatable way to trigger a freeze with a memory stress testing tool and that was from an unrecoverable error. I couldn't narrow the problem down to a single stick or RAM channel, it seemed to only happen if all 4 slots were occupied, but I eventually figured out that if I just lowered the RAM speed from standard 3200 MHz to the next officially supported (by the sticks) step of 2933 MHz, everything was fine again and no more ECC errors, recoverable or not. Been running like that since.

Last winter I was helping someone put together a new gaming machine... it was so frustrating running into the fake ecc marketing for DDR5 that you mention. The motherboard situation for whether they support it or not, or whether a bios update added support or then removed it or added it back or not, was also really sad. And even worse IMO is that you can't actually max out 4 slots on the top tier mobos unless you're willing to accept a huge drop in RAM speed. Leads to ugly 48 GB sized sticks and limiting to two of them... In the end we didn't go with ECC for that someone, but I was pretty disappointed about it. I'm hoping the next gen will be better, for my own setup running ZFS and such I'm not going to give up ECC.

hedora · 2025-09-08T01:57:56 1757296676

You have to go pretty far down the rabbit hole to make sure you’ve actually got ECC with [LP]DDR5

Some vendors use hamming codes with “holes” in them, and you need the CPU to also run ECC (or at least error detection) between ram and the cache hierarchy.

Those things are optional in the spec, because we can’t have nice things.

BikiniPrince · 2025-09-08T01:56:38 1757296598

I pick up old serves for my garage system. With edac it is a dream to isolate the fault and be instantly aware. It also lets you determine the severity of the issue. Dimms can run for years with just the one error or overnight explode into streams of corrections. I keep spares so it’s fairly easy to isolate any faults. It’s just how do you want to spend your time?

Scramblejams · 2025-09-08T02:08:08 1757297288

I run a handful of servers and I have a couple that pop ECC errors every year or three, so YMMV.

swinglock · 2025-09-07T09:06:27 1757235987

Excellent point. It's a shame and a travesty that data integrity is still mostly locked away inside servers, leaving most other computing devices effectively toys, the early prototype demo thing but then never finished and sold forever at inflated prices.

I wish AMD would make ECC a properly advertised feature with clear motherboard support. At least DDR5 has some level of ECC.

kevin_thibedeau · 2025-09-08T00:38:53 1757291933

> At least DDR5 has some level of ECC.

That is mostly to assist manufacturers in selling marginal chips with a few bad bits scattered around. It's really a step backwards in reliability.

wpm · 2025-09-08T05:17:47 1757308667

I wish AMD wouldn't gate APU ECC support behind unobtainium "PRO" SKUs they only give out, seemingly, to your typical "business" OEMs and the rare Chinese miniPC company.

c0l0 · 2025-09-08T12:27:37 1757334457

It's not that dire as you make it out to be :)

Both the 8700G and the 8700G PRO are readily available in the EU, and the PRO SKU is about 50% more expensive (EUR 120 in absolute numbers): https://geizhals.eu/?cmp=3096260&cmp=3096300&cmp=3200470&act...

rendaw · 2025-09-08T05:09:35 1757308175

So I'm trying to learn more about this stuff, but aren't there multiple ECC flavors and the AMD consumer CPUs only support one of them (not the one you'd have on servers?)

Does anyone maintain a list with de-facto support of amd chips and mainboards? That partlist site only shows official support IIRC, so it won't give you any results.

adrian_b · 2025-09-08T05:31:15 1757309475

The difference between the "unbuffered" ECC DIMMs (ECC UDIMMs), which you must use in desktop motherboards (and in some of those advertised as "workstation" MBs) and the "registered" ECC DIMMs (ECC RDIMMs), which you must use in server motherboards (and in some of the "workstation" MBs), has existed for decades.

However in the past there have existed very few CPU models and MBs that supported either kind of DIMMs, while today this has become completely impossible, as the mechanical and electrical differences between them have increased.

In any case, today, like also 20 years ago, when searching for ECC DIMMs you must always search only the correct type, e.g. unbuffered ECC DIMMs for desktop CPUs.

In general, registered ECC DIMMs are easier to find, because wherever "server memory" is advertised, that is what is meant. For desktop ECC memory, you must be careful to see both "ECC" and "unbuffered" mentioned in the module description.

hungmung · 2025-09-08T05:20:45 1757308845

Seconding this. I'm looking for a fanless industrial mini PC with out of band ECC and I'm having a hell of a time.

adrian_b · 2025-09-08T05:50:06 1757310606

Had you been looking for "in-band ECC", the cheap ODROID H4 PLUS ($150) or the cheaper ODROID H4 ($110) would have been fine, or for something more expensive some of the variants of Asus NUC 13 Rugged support in-band ECC.

For out-of-band ECC, e.g. with standard ECC SODIMMs, all the embedded SBCs that I have seen used only CPUs that are very obsolete nowadays, i.e. ancient versions of Intel Xeon or old AMD industrial Ryzen CPUs (AMD's series of industrial Ryzen CPUs are typically at least one or two generations behind their laptop/desktop CPUs).

Moreover all such industrial SBCs with ECC SODIMMs were rather large, i.e. either in the 3.5" form factor or in the NanoITX form factor (120 mm x 120 mm), and it might have been necessary to replace their original coolers with bigger heatsinks for fanless operation.

In-band ECC causes a significant decrease of the performance, but for most applications of such mini-PCs the performance is completely acceptable.

nicman23 · 2025-09-08T05:51:28 1757310688

https://www.asrockind.com/en-gb/iBOX-V2000V

something like that?

devnullbrain · 2025-09-08T11:26:57 1757330817

I like the warning not to buy a motherboard from a manufacturer that has been defunct for 17 years

storus · 2025-09-08T11:38:27 1757331507

Now where can I get 64GB ECC UDIMM DDR5 modules so that my X870E board can have 256GB RAM? The largest I found were just 48GB ECC UDIMMs or 64GB non-ECC UDIMMs.

c0l0 · 2025-09-08T11:54:30 1757332470

I don't think 64GB ECC UDIMM is commercially available yet. I use Geizhals to check for EU availability: https://geizhals.eu/?cat=ramddr3&xf=7500_DDR5~7501_DIMM~7761...

In my experience, it's generally unwise to push the platform you're on to the outermost of its spec'd limits. At work, we bought several 5950X-based Zen3 workstations with 128GB of 3200MT/s ECC UDIMM, and two of these boxes will only ever POST when you manually downclock memory to 3000MT/s. Past a certain point, it's silicon lottery deciding if you can make reality live up to the datasheets' promises.

storus · 2025-09-08T12:07:09 1757333229

I am fine with downclocking the RAM; my X870E board (ProArt) should be fine running ECC, I only use 9800X3D to have a single CCD (maybe upgraded later to EPYC 4585PX) and together have RTX 6000 Pro and 2x NVLinked A6000 in PCIe slots, with two M.2 SSDs. Power supply follows the latest specs as well. This build was meant to be a light-weight Threadripper replacement and ECC is a must for my use cases (it's a build for my summer house so that I can do serious work while there).

unethical_ban · 2025-09-08T15:48:21 1757346501

Any specific recommendations? I am having random, OS agnostic lockups on my ryzen 1xxx build and thought DDR5 will be enough, but true ECC sounds good.

edit: Looks like a lot of Asus motherboards work, and the thing to look for is "unbuffered" ECC. Kingston has some, I see 32GB module for $190 on Newegg.

moffkalast · 2025-09-08T10:53:40 1757328820

Do you live at a very high altitude with a significant amount of solar radiation, or at an underfunded radiology lab or perhaps near a uranium deposit or a melted down nuclear reactor? Because the average machine should never see a memory bit flip error at all during its entire lifetime.

rkomorn · 2025-09-08T10:56:54 1757329014

Then how do you explain all the bugs in the software I write?!

moffkalast · 2025-09-08T10:58:39 1757329119

It truly is a cosmic mystery :)

yndoendo · 2025-09-08T15:37:33 1757345853

Bit flipping can be the byproduct of bow the system components harmonize. Role hammer RAM also has the same affect. [0]

[0] https://en.m.wikipedia.org/wiki/Row_hammer

moffkalast · 2025-09-09T21:10:16 1757452216

> Furthermore, research shows that precisely targeted three-bit Rowhammer flips prevents ECC memory from noticing the modifications.

Doesn't exactly sound like a use case for ECC memory, given that it can't correct these attacks. Interesting though, I'd have thought that virtual addresses would've largely fixed this.

c0l0 · 2025-08-13T20:11:33 1755115893

I've been in this business for a while now, and I continue to be surprised by the extent of how cloud customers are being milked by cloud platform providers. And, of course, their seemingly limitless tolerance for it.

Spooky23 · 2025-08-13T20:17:11 1755116231

It is amazing. I just left a discussion where the protagonist is moving a legacy workload to a hyperscaler to avoid some software licensing costs. Re-implemented with cloud in mind, it would probably run $10-15k/year to run. As it stands as a lift and shift, likely something like $250k. The total value of the software licensing is <$30k.

Math isn't mathing, but the salesperson implanted the idea. lol

nickaggarwal · 2025-08-13T20:28:31 1755116911

I agree, if you go with the wrong solutions, it can inflate the costs

c0l0 · 2025-07-19T10:25:28 1752920728

The article touches upon an important point that applies to all complex (software/computer) and long-lived systems: "Too much outdated and inconsistent documentation (that makes learning the numerous tools needlessly hard.)"

The Debian Wiki is a great resource for many topics, but as with all documentation for very long-running projects - at least those that do big, "finished" releases from time to time - it seems tough to strike a balance between completeness and (temporal) relevance. Sure, in some weird edge case scenario, it might be helpful to know how this-and-that behaved or could be worked around in Debian 6 "Squeeze" in 2014, but information like that piling up also makes the article on the this-and-that subject VERY tedious to sift through if you are only interested in what's recent and relevant to Debian 12 "Bookworm" in 2025.

Most people contributing to documentation efforts (me included) seem very reluctant to throw out existing content in a wiki article, even though an argument could be made that the presence is sometimes objectively unhelpful for solving today's problems.

Maybe it would be worth a shot to fork the wiki (by copying all content and have a debian.wiki.org/6/ prefix for all things Squeeze, a /7/ for Wheey, a /12/ for bookworm, etc.) for each major release and encourage people to edit and extend release-specific pages with appropriate information, so readers and editors would have to "time-travel" through the project (anmd problem/solution) history in a more conscious and hopefully less confusing way, and make it easier for editors to prune information that's not just relevant for release N+1 any more.

I'm very open to learning more about anyone's thoughts on how to solve this well: How to keep documentation in "living documents", editable not only by a small group of contributors (like many projects to with mkdocs et al. as a replacement for an actual wiki), but also keep the "historic baggage" both easily discoverable (for when it's relevant and useful, because that does happen), yet not have it stand in the way of all those who will be confused and obstructed by its presence.

hereonout2 · 2025-07-19T13:26:33 1752931593

Are you acquainted with the new maintainers guide rather than the wiki?

To be honest I found it an incredibly comprehensive overview of Debian packaging, all the way up to using pbuilder to ensure dependencies and sandboxed builds, onto lintian to assess the quality of the artifacts.

https://www.debian.org/doc/manuals/maint-guide/

Building complex Debian packages is time consuming with a lot to learn, but to be honest I don't remember having many issues with this guide when I started out.

danesparza · 2025-07-19T15:17:20 1752938240

You are only making the original author's point for him even more.

I didn't even know this guide existed, for example -- because of all the existing noise that exists in the same space.

I have managed to build Debian packages (and even self-host a repository) IN SPITE of the existing documentation, not because of it.

hereonout2 · 2025-07-19T15:39:16 1752939556

That's not the experience I had.

The guide I linked to used to be linked to from the main docs page I believe - I went to double check and it now has this more recent guide linked instead - it seems equally thorough.

https://www.debian.org/doc/manuals/debmake-doc/

These guides were sufficient for me to learn packaging pretty complex Debian projects, and are linked to from the docs home page on the Debian site. Guess that's all I'm saying.

kmacleod · 2025-07-22T19:58:10 1753214290

I'm building a community site DevOptimize.org: The Art of Packaging[0] for this type of content. Would be glad to host with an interested editor(s). Near-term roadmap includes wiki editing, currently in git.

[0] https://devoptimize.org/

c0l0 · 2025-07-19T07:36:24 1752910584

Alas, it will virtually exclusively "be shipped to consumers as a default".