I came here to post this, too :) What the thingino community managed to do with their firmware for these cameras is nothing short of amazing - if you happen to have a compatible camera, you really, really should give it a whirl!
I'd love to but... how? One alternative seems to be a programmer chip that must be puchased and then modified to not fry the camera with 5V. Another is maybe stripping a USB cable and soldering it to the wifi pads on the camera chip?
Neither of these seem like good ideas for someone like me, who is relatively hardware naïve and has small children running around making it hard to concetrate for more than 30 minutes at a time.
The question is genuine. I want to do this but don't actually know by which method.
Yeah, I can see why that is a show-stopper for people. However, the thingino project has people among them who care deeply about ease of installation - so with these security issues discovered in the TP-Link device, chances are an installation method that relies on a vulnerable stock firmware will be provided in time :)
In this case I'm asking specifically about the C200 this article is about. Sorry for not being more clear. From what I understand the C200 does not boot from SD card.
I think Thingino is great. But there are definitely still dragons lurking. I reported a bug last year and mostly forgot about it. Got a response a few months ago to check out a fix related to unexpected memory access.
I generally try not to be a huge Rust cheerleader but seriously. Yikes.
I realize this is mostly tangential to the article, but a word of warning for those who are about to mess with overcommit for the first time: In my experience, the extreme stance of "always do [thing] with overcommit" is just not defensible, because most (yes, also "server") software is just not written under the assumption that being able to deal with allocation failures in a meaningful way is a necessity. At best, there's an "malloc() or die"-like stanza in the source, and that's that.
You can and maybe even should disable overcommit this way when running postgres on the server (and only a minimum of what you would these days call sidecar processes (monitoring and backup agents, etc.) on the same host/kernel), but once you have a typical zoo of stuff using dynamic languages living there, you WILL blow someone's leg off.
I run my development VM with overcommit disabled and the way stuff fails when it runs out of memory is really confusing and mysterious sometimes. It's useful for flushing out issues that would otherwise cause system degradation w/overcommit enabled, so I keep it that way, but yeah... doing it in production with a bunch of different applications running is probably asking for trouble.
The fundamental problem is that your machine is running software from a thousand different projects or libraries just to provide the basic system, and most of them do not handle allocation failure gracefully. If program A allocates too much memory and overcommit is off, that doesn't necessarily mean that A gets an allocation failure. It might also mean that code in library B in background process C gets the failure, and fails in a way that puts the system in a state that's not easily recoverable, and is possibly very different every time it happens.
For cleanly surfacing errors, overcommit=2 is a bad choice. For most servers, it's much better to leave overcommit on, but make the OOM killer always target your primary service/container, using oom-score-adj, and/or memory.oom.group to take out the whole cgroup. This way, you get to cleanly combine your OOM condition handling with the general failure case and can restart everything from a known foundation, instead of trying to soldier on while possibly lacking some piece of support infrastructure that is necessary but usually invisible.
There's also cgroup resource controls to separately govern max memory and swap usage. Thanks to systemd and systemd-run, you can easily apply and adjust them on arbitrary processes. The manpages you want are systemd.resource-control and systemd.exec. I haven't found any other equivalent tools that expose these cgroup features to the extent that systemd does.
I really dislike systemd, and its monolithic mass of over-engineered, all encompassing code. So I have to hang a comment here, showing just how easy this is to manage in a simple startup script. How these features are always exposed.
Taken from a SO post:
# Create a cgroup
mkdir /sys/fs/cgroup/memory/my_cgroup
# Add the process to it
echo $PID > /sys/fs/cgroup/memory/my_cgroup/cgroup.procs
# Set the limit to 40MB
echo $((40 \* 1024 \* 1024)) > /sys/fs/cgroup/memory/my_cgroup/memory.limit_in_bytes
Linux is so beautiful. Unix is. Systemd is like a person with makeup plastered 1" thick all over their face. It detracts, obscures the natural beauty, and is just a lot of work for no reason.
This is a better explanation and fix than others I've seen. There will be differences between desktop and server uses, but misbehaving applications and libraries exist on both.
> he way stuff fails when it runs out of memory is really confusing
have you checked what your `vm.overcommit_ratio` is? If its < 100%, then you will get OOM kills even if plenty of RAM is free since the default is 50 i.e. 50% of RAM can be COMMITTED and no more.
curious what kind of failures you are alluding to.
The main scenario that caused me a lot of grief is temporary RAM usage spikes, like a single process run during a build that uses ~8gb of RAM or more for a mere few seconds and then exits. In some cases the oom killer was reaping the wrong process or the build was just failing cryptically and if I examined stuff like top I wouldn't see any issue, plenty of free RAM. The tooling for examining this historical memory usage is pretty bad, my only option was to look at the oom killer logs and hope that eventually the culprit would show up.
Thanks for the tip about vm.overcommit_ratio though, I think it's set to the default.
you can get statistics off cgroups to get idea what it was (assuming it's a service and not something user ran), but that requires probing it often enough
> At best, there's an "malloc() or die"-like stanza in the source, and that's that.
In fairness, i don't know what else general purpose software is supposed to do here other than die. Its not like there is a graceful way to handle insufficient memory to run the program.
In theory, a process could just return an error for that specific operation, which would propagate to a "500 internal error" for this one request but not impact other operations. Could even take the hint to free some caches.
But in practice, I agree with you. This is just not worth it. So much work to handle it properly everywhere and it is really difficult to test every malloc failures.
So that's where an OOM killer might have a better strategy than just letting the last program that happen to allocate memory last to fail.
Let new generations of Free Software orgs come along and supplant GNU with a GBIR (GNU But In Rust), but don't insist on existing, established things that are perfectly good for who and what they are to change into whatever you prefer at any given moment.
I wrote https://johannes.truschnigg.info/writing/2024-07-impending_g... in response to the CrowdStrike fallout, and was tempted to repost it for the recent CloudFlare whoopsie. It's just too bad that publishing rants won't change the darned status quo! :')
People will not do anything until something really disastrous happens. Even afterwards memories can fade. Cloudstrike has not lost many customers.
Covid is a good parallel. A pandemic was always possible, there is always a reasonable chance of one over the course of decades. However people did not take it seriously until it actually happened.
A lot of Asian countries are a lot better prepared for a tsunami then they were before 2004.
The UK was supposed to have emergency plans for a pandemic, but it was for a flu variant, and I suspect even those plans were under-resourced and not fit for purpose. We are supposed to have plans for a solar storm but when another Carrington even occurs I very much doubt we will deal with it smoothly.
Very cool project - hoping to see follow-up designs that can do more than 1Gbps per port!
I recently built a fully Layer2-transparent 25Gbps+ capable wireguard-based solution for LR fiber links at work based on Debian with COTS Zen4 machines and a purpose-tailored Linux kernel build - I'd be curious to know what an optimized FPGA can do compared to that.
Yes, Jumbo frames unlock a LOT of additional performance - which is exactly what we have and need on those links. Using a vanilla wg-bench[0] loopback-esque (really veths across network namespaces) setup on the machine, I get slightly more than 15Gbps sustained throughput.
Just to elaborate for others, MACSec is a standard (802.1ae) and runs at line rate. Something like a Juniper PTX10008 can run it at 400Gbps, and it’s just a feature you turn on for the port you’d be using for the link you want to protect anyway (PTXs are routers/switches, not security devices).
If I need to provide encryption on a DCI, I’m at least somewhat likely to have gear that can just do this with vendor support instead of needing to slap together some Linux based solution.
Unless, I suppose, there’s various layer 2 domains you’re stitching together with multiple L2 hops and you don’t control the ones in the middle. In which case I’d just get a different link where that isn’t true.
I have at least one switch that's MACSec compatible at line speed but I haven't had time to take a look. I guess this is confined to LAN and cannot do a MACSec link through the internet, isn't it?
Generally its used when you have links going between two of your sites, so you typically only need it on your switch or router that terminate that link.
I realize this has not much to do with CPU choice per se, but I'm still gonna leave this recommendation here for people who like to build PCs to get stuff done with :) Since I've been able to afford it and the market has had them available, I've been buying desktop systems with proper ECC support.
I've been chasing flimsy but very annoying stability problems (some, of course, due to overclocking during my younger years, when it still had a tangible payoff) enough times on systems I had built that taking this one BIG potential cause out of the equation is worth the few dozens of extra bucks I have to spend on ECC-capable gear many times over.
Trying to validate an ECC-less platform's stability is surprisingly hard, because memtest and friends just aren't very reliably detecting more subtle problems. PRIME95, y-cruncher and linpack (in increasing order of effectiveness) are better than specialzied memory testing software in my experience, but they are not perfect, either.
Most AMD CPUs (but not their APUs with potent iGPUs - there, you will have to buy the "PRO" variants) these days have full support for ECC UDIMMs. If your mainboard vendor also plays ball - annoyingly, only a minority of them enables ECC support in their firmware, so always check for that before buying! - there's not much that can prevent you from having that stability enhancement and reassuring peace of mind.
> only a minority of them enables ECC support in their firmware, so always check for that before buying!
This is the annoying part.
That AMD permits ECC is a truly fantastic situation, but if it's supported by the motherboard is often unlikely and worse: it's not advertised even when it's available.
I have an ASUS PRIME TRX40 PRO and the tech specs say that it can run ECC and non-ECC but not if ECC will be available to the operating system, merely that the DIMMS will work.
It's much more hit and miss in reality than it should be, though this motherboard was a pricey one: one can't use price as a proxy for features.
EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
but `dmidecode --type 16` says:
Error Correction Type: None
Error Information Handle: Not Provided
AFAIK, I have 2x DDR5 non-ECC memory (`dmidecode --type 17` says Samsung M425R1GB4BB0-CQKOL). Your command tells about SECDEC (single bit error correction, double bit error detection).
Usually, if a vendor's spec sheet for a (SOHO/consumer-grade) motherboard mentions ECC-UDIMM explicitly in its memory compatibility section, and (but this is a more recent development afaict) DOES NOT specify something like "operating in non-ECC mode only" at the same time, then you will have proper ECC (and therefore EDAC and RAS) support in Linux, if the kernel version you have can already deal with ECC on your platform in general.
I would assume your particular motherboard to operate with proper SECDED+-level ECC if you have capable, compatible DIMM, enable ECC mode in the firmware, and boot an OS kernel that can make sense of it all.
This is weird. I have used many ASUS MBs specified as "can run ECC and non-ECC" and this has always meant that there was an ECC enabling option in the BIOS settings, and then if the OS had an appropriate EDAC driver for the installed CPU ECC worked fine.
I am writing this message on such an ASUS MB with a Ryzen CPU and working ECC memory. You must check that you actually have a recent enough OS to know your Threadripper CPU and that you have installed any software package required for this (e.g. on Linux "edac-utils" or something with a similar name).
The big problem with ECC for me is that the sticks are so much more expensive. You'd expect ECC UDIMMs to have a price premium of just over 12.5% (because there are 9 chips instead of 8), but it's usually at least 100%. I don't mind paying reasonable premium for ECC, but paying double is too hard to swallow.
Trouble with enterprise is that the people buying care about the technology, but not the cost, while the people that do care about cost don’t understand the technology.
Some businesses (and governments) try and unify their purchasing, but this seems to make things worse, with the purchasing department both not understanding technology and being outwitted by vendors.
> Trouble with enterprise is that the people buying care about the technology, but not the cost
Enterprise also ruins it for small/medium businesses as well, at least those with dedicated internal IT departments who do care about both the technology and the cost. We are left with unreliable consumer-grade hardware, or prohibitively expensive enterprise hardware.
There's very little in between. This market is also underserved with software/SaaS as well with the SSO Tax and whatnot. There's a huge gap between "I'm taking the owner's CC down to best buy" and "Enterprise" that gets screwed over.
Yeah, with that kind of markup you might as well just buy new ones IF they break, or just spend the extra budget on better quality parts. Just having to pick a very specific motherboard that probably is very much not optimal for your build will blow the costs up even more, and for what gain?
I've been building my own gaming and productivity rigs for 20 years and I don't think memory has ever been a problem. Maybe survivorship bias, but surely even budget parts aren't THIS bad.
Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.
Without knowing how to fix that error you've lost 200 revisions of work. You can go back and find which revision had the problem, go before that, and upgrade it to the latest blender, but all your 200 revisions were made on other versions that you can't backport.
So don't upgrade it. Export it to an agnostic format and re-import it in the new version. Since it's failing to upgrade, it must be a metadata issue, not a data issue, so removing the Blender-specific bits will fix it.
What a silly hypothetical. There's a myriad freak occurrences that could make you have to redo work that you don't worry about. Now, I'm not saying single-bit errors don't happen. They just typically don't result in the sort of cascading failure you're describing.
Doing a lossy export/reimport process probably isn't going to be viable on something like a big movie scene blender file with lots of constraints, scripted modifiers and stuff that doesn't automatically come through with an export to USD.
My point is that there are scenarios where corruption in the past puts you in a bind and can cause a lot of loss of work or expensive diagnostic and recovery process long after it first occurred, blender was just one example but it can be much worse with proprietary software binary formats where you don't have any chance of jumping into the debugger to figure out what's going wrong with an upgrade or export. And maybe the subscription version of it won't even let you go back to the old version.
> There's a myriad freak occurrences that could make you have to redo work that you don't worry about.
Yes other sources of corruption are more likely from things like software errors. It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc., but you do have a budget and ECC is much cheaper relative to that. That doesn't mean it always makes sense for everyone to pay more for ECC. But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.
>a big movie scene blender file with lots of constraints, scripted modifiers and stuff
Not really what I would call an "asset", but fine.
>It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc.
Hell, I was thinking something way simpler, like your cat climbing on the case and throwing up through the top vents, or you tripping and dropping your ass on your desk and sending everything flying.
>But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.
Yeah, because those people aren't buying their own machines. If the credit card is yours and you're not doing something super critical, you're probably better served by a faster processor than by worrying against freak accidents.
>Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.
And let's say you have archived copies of it with checksums like I suggested, going back to all revisions ago.
What's the issue again now, that ECC would have solved? Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.
You would think that competition would naturally regulate the price down, but it seems like we are dealing with some sort of a cartel that regulators have not caught up with yet.
Isn't it mostly an ease of mind thing? I've never seen a ECC error on my home server which has plenty of memory in use and runs longer than my desktop. Maybe it's more common with higher clocked, near the limit, desktop PC's.
Also: DDR5 has some false ecc marketing due to the memory standard having an error correction scheme build in. Don't fall for it.
Whether you will see ECC errors depends a lot on how much memory you have and how old it is.
A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.
When DIMMs are new, at the usual amounts of memory for desktops, you will see at most a few errors per year, sometimes only an error after a few years. With old DIMMs, some of them will start to have frequent errors (such modules presumably had a borderline bad fabrication quality and now have become worn out, e.g. due to increased leakage leading to storing a lower amount of charge on the memory cell capacitors).
For such bad DIMMs, the frequency of errors will increase, and it may become of several errors per day, or even per hour.
For me, a very important advantage of ECC has been the ability to detect such bad memory modules (in computers that have been used for 5 years or more) and replace them before corrupting any precious data.
I also had a case with a HP laptop with ECC, where memory errors had become frequent after being stored for a long time (more than a year) in a rather humid place, which might have caused some oxidation of the SODIMM socket contacts, because removing the SODIMMs, scrubbing the sockets and reinserting the SODIMMs made disappear the errors.
>A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.
No. Or well, not exactly. More bits will flip randomly, but if between the two systems only the total installed memory changed, both systems will see the same amount of memory errors, because bit flips on the additional 48 GB will not result in errors, because they will not be used. Memory errors scale with memory used not with memory installed.
The extra unused memory might even act as shielding to cosmic rays, but the extra electrical load on the memory controller might more than balance that out for unbuffered sticks
I see a particular ECC error at least weekly on my home desktop system, because one of my DIMMs doesn't like the (out of spec) clock rate that I make it operate at. Looks like this:
94 2025-08-26 01:49:40 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68aea758, cpuid=0x00a50f00, bank=0x00000012
95 2025-09-01 09:41:50 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68b80667, cpuid=0x00a50f00, bank=0x00000012
(this is `sudo ras-mc-ctl --errors` output)
It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC.
"Breaks down" is a strong choice of words for a single, corrected bit error. ECC works as designed, and demonstrates that it does by detecting this re-occurring error. I take the confidence mostly from experience ;)
And no, as ECC UDIMM for the speed (3600MHz) I run mine at simply does not exist - it is outside of what JEDEC ratified for the DDR4 spec.
I would loosen the memory timings a bit and see if that resolves the ECC errors. x265 performance shouldn't fall since it generally benefits more from memory clock rate than latency.
Also, could you share some relevant info about your processor, mainboard, and UEFI? I see many internet commenters question whether their ECC is working (or ask if a particular setup would work), and far fewer that report a successful ECC consumer desktop build. So it would be nice to know some specific product combinations that really work.
There's probably many others with proper ECC support. Vendor spec sheets usually hint at properly working ECC in their firmware if they mention "ECC UDIMM" support specifically.
As for CPUs, that is even easier for AM4: Everything that's not based on a APU core (there are some SKUs marketed without iGPU that just have the iGPU part of the APU disabled, such as the Ryzen 5 5500) cannot support ECC. An exception to that rule are "PRO"-series APUs, such as the Ryzen 5 PRO 5650G et al., which have an iGPU, but also support ECC. Main differences (apart from the integrated graphics) between CPU and APU SKUs is that the latter do not support PCIe 4.0 (APUs are limited to PCIe 3.0), and have a few Watts lower idle power consumption.
If I were to build an AM5 system today, I would look into mainboards from ASUS for proper ECC support - they seem to have it pretty much universally supported on their gear. (Actual out-of-band ECC with EDAC support on Linux, not the DDR5 "on-DIE" stuff.)
I think you've found a particularly weak memory cell, I would start thinking about replacing that module. The consistent memory_channel=1, csrow=0 pattern confirms it's the same physical location failing predictably.
I had a somewhat dodgy stick of used RAM (DDR4 UDIMM) in a Supermicro X11 board. This board is running my NAS, all ZFS, so RAM corruption can equal data corruption. The OS alerted me to recoverable errors on DIMM B2. Swapped it and another DIMM, rebooted, saw DIMM error on slot B1. Swapped it for a spare stick. No more errors.
This was running at like, 1866 or something. It's a pretty barebones 8th gen i3 with a beefier chipset, but ECC still came in clutch. I won't buy hardware for server purposes without it.
I saw a corrected memory error logged every few hours when my current machine was new. It seems to have gone away now, so either some burn-in effect, or ECC accidentally got switched off and all my data is now corrupted. Threadripper 7000 series, 4x64GB DDR5.
Edit: it's probably because I switched it to "energy efficiency mode" instead of "performance mode" because it would occasionally lock up in performance mode. Presumably with the same root cause.
I have a slightly older system with 128 GB of UDIMM DDR4 over four sticks. Ran just fine for quite a while but then I started having mysterious system freezes. Later discovered I had somehow disabled ECC error reporting in my system log on linux... once that was turned back on, oh, I see notices of recoverable errors. I finally found a repeatable way to trigger a freeze with a memory stress testing tool and that was from an unrecoverable error. I couldn't narrow the problem down to a single stick or RAM channel, it seemed to only happen if all 4 slots were occupied, but I eventually figured out that if I just lowered the RAM speed from standard 3200 MHz to the next officially supported (by the sticks) step of 2933 MHz, everything was fine again and no more ECC errors, recoverable or not. Been running like that since.
Last winter I was helping someone put together a new gaming machine... it was so frustrating running into the fake ecc marketing for DDR5 that you mention. The motherboard situation for whether they support it or not, or whether a bios update added support or then removed it or added it back or not, was also really sad. And even worse IMO is that you can't actually max out 4 slots on the top tier mobos unless you're willing to accept a huge drop in RAM speed. Leads to ugly 48 GB sized sticks and limiting to two of them... In the end we didn't go with ECC for that someone, but I was pretty disappointed about it. I'm hoping the next gen will be better, for my own setup running ZFS and such I'm not going to give up ECC.
You have to go pretty far down the rabbit hole to make sure you’ve actually got ECC with [LP]DDR5
Some vendors use hamming codes with “holes” in them, and you need the CPU to also run ECC (or at least error detection) between ram and the cache hierarchy.
Those things are optional in the spec, because we can’t have nice things.
I pick up old serves for my garage system. With edac it is a dream to isolate the fault and be instantly aware. It also lets you determine the severity of the issue. Dimms can run for years with just the one error or overnight explode into streams of corrections. I keep spares so it’s fairly easy to isolate any faults. It’s just how do you want to spend your time?
Excellent point. It's a shame and a travesty that data integrity is still mostly locked away inside servers, leaving most other computing devices effectively toys, the early prototype demo thing but then never finished and sold forever at inflated prices.
I wish AMD would make ECC a properly advertised feature with clear motherboard support. At least DDR5 has some level of ECC.
I wish AMD wouldn't gate APU ECC support behind unobtainium "PRO" SKUs they only give out, seemingly, to your typical "business" OEMs and the rare Chinese miniPC company.
So I'm trying to learn more about this stuff, but aren't there multiple ECC flavors and the AMD consumer CPUs only support one of them (not the one you'd have on servers?)
Does anyone maintain a list with de-facto support of amd chips and mainboards? That partlist site only shows official support IIRC, so it won't give you any results.
The difference between the "unbuffered" ECC DIMMs (ECC UDIMMs), which you must use in desktop motherboards (and in some of those advertised as "workstation" MBs) and the "registered" ECC DIMMs (ECC RDIMMs), which you must use in server motherboards (and in some of the "workstation" MBs), has existed for decades.
However in the past there have existed very few CPU models and MBs that supported either kind of DIMMs, while today this has become completely impossible, as the mechanical and electrical differences between them have increased.
In any case, today, like also 20 years ago, when searching for ECC DIMMs you must always search only the correct type, e.g. unbuffered ECC DIMMs for desktop CPUs.
In general, registered ECC DIMMs are easier to find, because wherever "server memory" is advertised, that is what is meant. For desktop ECC memory, you must be careful to see both "ECC" and "unbuffered" mentioned in the module description.
Had you been looking for "in-band ECC", the cheap ODROID H4 PLUS ($150) or the cheaper ODROID H4 ($110) would have been fine, or for something more expensive some of the variants of Asus NUC 13 Rugged support in-band ECC.
For out-of-band ECC, e.g. with standard ECC SODIMMs, all the embedded SBCs that I have seen used only CPUs that are very obsolete nowadays, i.e. ancient versions of Intel Xeon or old AMD industrial Ryzen CPUs (AMD's series of industrial Ryzen CPUs are typically at least one or two generations behind their laptop/desktop CPUs).
Moreover all such industrial SBCs with ECC SODIMMs were rather large, i.e. either in the 3.5" form factor or in the NanoITX form factor (120 mm x 120 mm), and it might have been necessary to replace their original coolers with bigger heatsinks for fanless operation.
In-band ECC causes a significant decrease of the performance, but for most applications of such mini-PCs the performance is completely acceptable.
Now where can I get 64GB ECC UDIMM DDR5 modules so that my X870E board can have 256GB RAM? The largest I found were just 48GB ECC UDIMMs or 64GB non-ECC UDIMMs.
In my experience, it's generally unwise to push the platform you're on to the outermost of its spec'd limits. At work, we bought several 5950X-based Zen3 workstations with 128GB of 3200MT/s ECC UDIMM, and two of these boxes will only ever POST when you manually downclock memory to 3000MT/s. Past a certain point, it's silicon lottery deciding if you can make reality live up to the datasheets' promises.
I am fine with downclocking the RAM; my X870E board (ProArt) should be fine running ECC, I only use 9800X3D to have a single CCD (maybe upgraded later to EPYC 4585PX) and together have RTX 6000 Pro and 2x NVLinked A6000 in PCIe slots, with two M.2 SSDs. Power supply follows the latest specs as well. This build was meant to be a light-weight Threadripper replacement and ECC is a must for my use cases (it's a build for my summer house so that I can do serious work while there).
Any specific recommendations? I am having random, OS agnostic lockups on my ryzen 1xxx build and thought DDR5 will be enough, but true ECC sounds good.
edit: Looks like a lot of Asus motherboards work, and the thing to look for is "unbuffered" ECC. Kingston has some, I see 32GB module for $190 on Newegg.
Do you live at a very high altitude with a significant amount of solar radiation, or at an underfunded radiology lab or perhaps near a uranium deposit or a melted down nuclear reactor? Because the average machine should never see a memory bit flip error at all during its entire lifetime.
> Furthermore, research shows that precisely targeted three-bit Rowhammer flips prevents ECC memory from noticing the modifications.
Doesn't exactly sound like a use case for ECC memory, given that it can't correct these attacks. Interesting though, I'd have thought that virtual addresses would've largely fixed this.
I've been in this business for a while now, and I continue to be surprised by the extent of how cloud customers are being milked by cloud platform providers. And, of course, their seemingly limitless tolerance for it.
It is amazing. I just left a discussion where the protagonist is moving a legacy workload to a hyperscaler to avoid some software licensing costs. Re-implemented with cloud in mind, it would probably run $10-15k/year to run. As it stands as a lift and shift, likely something like $250k. The total value of the software licensing is <$30k.
Math isn't mathing, but the salesperson implanted the idea. lol
The article touches upon an important point that applies to all complex (software/computer) and long-lived systems: "Too much outdated and inconsistent documentation (that makes learning the numerous tools needlessly hard.)"
The Debian Wiki is a great resource for many topics, but as with all documentation for very long-running projects - at least those that do big, "finished" releases from time to time - it seems tough to strike a balance between completeness and (temporal) relevance. Sure, in some weird edge case scenario, it might be helpful to know how this-and-that behaved or could be worked around in Debian 6 "Squeeze" in 2014, but information like that piling up also makes the article on the this-and-that subject VERY tedious to sift through if you are only interested in what's recent and relevant to Debian 12 "Bookworm" in 2025.
Most people contributing to documentation efforts (me included) seem very reluctant to throw out existing content in a wiki article, even though an argument could be made that the presence is sometimes objectively unhelpful for solving today's problems.
Maybe it would be worth a shot to fork the wiki (by copying all content and have a debian.wiki.org/6/ prefix for all things Squeeze, a /7/ for Wheey, a /12/ for bookworm, etc.) for each major release and encourage people to edit and extend release-specific pages with appropriate information, so readers and editors would have to "time-travel" through the project (anmd problem/solution) history in a more conscious and hopefully less confusing way, and make it easier for editors to prune information that's not just relevant for release N+1 any more.
I'm very open to learning more about anyone's thoughts on how to solve this well: How to keep documentation in "living documents", editable not only by a small group of contributors (like many projects to with mkdocs et al. as a replacement for an actual wiki), but also keep the "historic baggage" both easily discoverable (for when it's relevant and useful, because that does happen), yet not have it stand in the way of all those who will be confused and obstructed by its presence.
Are you acquainted with the new maintainers guide rather than the wiki?
To be honest I found it an incredibly comprehensive overview of Debian packaging, all the way up to using pbuilder to ensure dependencies and sandboxed builds, onto lintian to assess the quality of the artifacts.
Building complex Debian packages is time consuming with a lot to learn, but to be honest I don't remember having many issues with this guide when I started out.
The guide I linked to used to be linked to from the main docs page I believe - I went to double check and it now has this more recent guide linked instead - it seems equally thorough.
These guides were sufficient for me to learn packaging pretty complex Debian projects, and are linked to from the docs home page on the Debian site. Guess that's all I'm saying.
I'm building a community site DevOptimize.org: The Art of Packaging[0] for this type of content. Would be glad to host with an interested editor(s). Near-term roadmap includes wiki editing, currently in git.
reply