My friend mentioned this just before I published and I think that probably is th...

jiggawatts · on Jan 7, 2023

> which is how many hard drives you can plug in to a non-mainframe machine for historical image storage.

You would be surprised. First off, SSDs are denser than hard drives now if you're willing to spend $$$.

Second, "plug in" doesn't necessarily mean "in the chassis". You can expand storage with external disk arrays in all sorts of ways. Everything from external PCI-e cages to SAS disk arrays, fibre channel, NVMe-over-Ethernet, etc...

It's fairly easy to get several petabytes of fast storage directly managed by one box. The only limit is the total usable PCIe bandwidth of the CPUs, which for a current-gen EPYC 9004 series processors in a dual-socket configuration is something crazy like 512 GB/s. This vastly exceeds typical NIC speeds. You'd have to balance available bandwidth between multiple 400 Gbps NICs and disks to be able to saturate the system.

People really overestimate the data volume put out by a service like Twitter while simultaneously underestimating the bandwidth capability of a single server.

ilyt · on Jan 8, 2023

> People really overestimate the data volume put out by a service like Twitter while simultaneously underestimating the bandwidth capability of a single server.

It's outright comical. Above we have people thinking somehow amount of TLS connections single server can handle is a problem, in service where there would be hundreds of thousands lines of code to generate the content served over it, all while using numbers from what seems like 10+ years old server hardware

trishume · on Jan 7, 2023

That's really cool! Each year of historical images I estimate at 2.8PB, so it would need to scale quite far to handle multiple years. How would you actually connect all those external drive chassis, is there some kind of chainable SAS or PCIe that can scale arbitrarily far? I consider NVMe-over-fabrics to be cheating and just using multiple machines and calling it one machine, but "one machine" is kinda an arbitrary stunt metric.

ksec · on Jan 8, 2023

It depends on how you think of "one machines". :) You can fit 1PB in 1U without something like NVMe-over-fabrics. So in a 4U unit gives you plenty of room.

We have Zen4c 128 Core with DDR5 now. We might get a 256 Core Zen6c with PCI-E 6.0 and DDR6 by 2026.

I really like these exercise of trying to shrink the amount of server needed, especially those on Web usage. And the mention of Mainframe. Which dont get enough credit for. I did something similar with Netflix 800Gbps's post. [1] Where they could serve every single user with less than 50 Racks by the end of this decade.

[1] https://news.ycombinator.com/item?id=33451430

wtallis · on Jan 8, 2023

Stuff like [0] exists, allowing you to fan out a single server's PCIe to quite a few PCIe JBOD chassis. Considering that SSDs can get you ~1PB in 1U these days, you can get pretty far while still technically sticking with PCIe connectivity rather than NVMeoF.

[0] https://www.liqid.com/products/liqid-elements/liqid-48-port-...

metadat · on Jan 8, 2023

I was skeptical about 1PB in 1U, but searched and learned this was showcased by Supermicro as far back as 2018:

https://www.supermicro.com/en/pressreleases/supermicro-unlea...

NavinF · on Jan 8, 2023

> I consider NVMe-over-fabrics to be cheating

Is an infiniband switch connected to a bunch of machines that expose NVMe targets really that different from a SAS expander connected to a bunch of JBOD enclosures? Only difference is that the former can scale beyond 256 drives per controller and fill an entire data center. You're still doing all the compute on one machine so I think it still counts.

sayrer · on Jan 7, 2023

It's a neat thought exercise, but wrong for so many reasons (there are probably like 100s). Some jump out: spam/abuse detection, ad relevance, open graph web previews, promoted tweets that don't appear in author timelines, blocks/mutes, etc. This program is what people think Twitter is, but there's a lot more to it.

I think every big internet service uses user-space networking where required, so that part isn't new.

trishume · on Jan 7, 2023

I think I'm pretty careful to say that this is a simplified version of Twitter. Of the features you list:

- spam detection: I agree this is a reasonably core feature and a good point. I think you could fit something here but you'd have to architect your entire spam detection approach around being able to fit, which is a pretty tricky constraint and probably would make it perform worse than a less constrained solution. Similar to ML timelines.

- ad relevance: Not a core feature if your costs are low enough. But see the ML estimates for how much throughput A100s have at dot producting ML embeddings.

- web previews: I'd do this by making it the client's responsibility. You'd lose trustworthiness though so users with hacked clients could make troll web previews, they can already do that for a site they control, but not a general site.

- blocks/mutes: Not a concern for the main timeline other than when using ML, when looking at replies will need to fetch blocks/mutes and filter. Whether this costs too much depends on how frequently people look at replies.

I'm fully aware that real Twitter has bajillions of features that I don't investigate, and you couldn't fit all of them on one machine. Many of them make up such a small fraction of load that you could still fit them. Others do indeed pose challenges, but ones similar to features I'd already discussed.

sayrer · on Jan 7, 2023

"web previews: I'd do this by making it the client's responsibility."

Actually a good example of how difficult the problem is. A very common attack is to switch a bit.ly link or something like that to a malicious destination. You would also DoS the hosts... as the Mastodon folks are discovering (https://www.jwz.org/blog/2022/11/mastodon-stampede/)

For blocks/mutes, you have to account for retweets and quotes, it's just not a fun problem.

Shipping the product is much more difficult that what's in your post. It's not realistic at all, but it is fun to think about.

sayrer · on Jan 7, 2023

Here are some pointers:

"Our approach to blocking links" https://help.twitter.com/en/safety-and-security/phishing-spa...

"The Infrastructure Behind Twitter: Scale" https://blog.twitter.com/engineering/en_us/topics/infrastruc...

"Mux" https://twitter.github.io/finagle/guide/Protocols.html#mux

I do agree that some of this could be done better a decade later (like, using Rust for some things instead of Scala), but it was all considered. A single machine is a fun thing to think about, but not close to realistic. CPU time was not usually the concern in designing these systems.

sayrer · on Jan 7, 2023

Here's the Twitter edge server from years ago: https://courses.cs.washington.edu/courses/cse551/15sp/notes/...

NavinF · on Jan 8, 2023

I'll go ahead and quote that blog post because they block HN users using the referer header.

---

"Federation" now apparently means "DDoS yourself." Every time I do a new blog post, within a second I have over a thousand simultaneous hits of that URL on my web server from unique IPs. Load goes over 100, and mariadb stops responding.

The server is basically unusable for 30 to 60 seconds until the stampede of Mastodons slows down.

Presumably each of those IPs is an instance, none of which share any caching infrastructure with each other, and this problem is going to scale with my number of followers (followers' instances).

This system is not a good system.

Update: Blocking the Mastodon user agent is a workaround for the DDoS. "(Mastodon|http\.rb)/". The side effect is that people on Mastodon who see links to my posts no longer get link previews, just the URL.

---

I personally find this absolutely hilarious. Is that blog hosted on a Raspberry Pi or something? "Over a thousand" requests per second shouldn't even show up on the utilization graphs on a modern server. The comments suggest that he's hitting the database for every request instead of caching GET responses, but even with such a weird config a normal machine should be able to do over 10k/second without breaking a sweat.

ilyt · on Jan 8, 2023

> I personally find this absolutely hilarious. Is that blog hosted on a Raspberry Pi or something? "Over a thousand" requests per second shouldn't even show up on the utilization graphs on a modern server.

Mastodon is written on Ruby on Rails. That should really answer all your questions about the problem but if you're unfamiliar Ruby is slow compared to any compiled language, Rails is slow compared to near-every framework on the planet and it isn't written that well either.

NavinF · on Jan 9, 2023

That makes sense, but I'm pretty sure jwz was whining about his blog getting DDoSed not a mastodon server.

vidarh · on Jan 8, 2023

While that may be funny, the number of Mastodon instances is growing rapidly, to the point where it will need to eventually be dealt with (not least because hosting on a Pi or having a badly optimized setup both happens in real life). But more to this example, it shows passing preview responsibility to end user clients is a far bigger problem. Eg not many would be able to handle the onslaught of being linked to from a highly viral tweet if previews weren't cached.

dpkirchner · on Jan 7, 2023

FWIW, jwz uses referer checking to redirect links from HN ... for "DoS" reasons.

ilyt · on Jan 8, 2023

Well, Mastodon is criminally slow

mschuster91 · on Jan 7, 2023

> I haven't looked into it, but I wouldn't be surprised if they could get around the trickiest constraint, which is how many hard drives you can plug in to a non-mainframe machine for historical image storage.

Netapp is at something > 300TB storage per node IIRC, but in any case it would make more sense to use some cloud service. AWS EFS and S3 don't have any (practically reachable) limit in size.

threeseed · on Jan 8, 2023

Have you actually used EFS/S3 before ?

Because both are ridiculously slow to the point where they would be completely unusable for a service such as Twitter whose current latency is based off everything largely being in memory.

And Twitter already evaluated using the cloud for their core services and it was cost-prohibitive compared to on-premise.

mschuster91 · on Jan 9, 2023

EFS indeed sometimes has latency issues I've never been able to track down, but S3 with Cloudfront? That is more than enough as a CDN.

toast0 · on Jan 7, 2023

> I wouldn't be surprised if they could get around the trickiest constraint, which is how many hard drives you can plug in to a non-mainframe machine for historical image storage.

Some commodity machines use external SAS to connect to more disk boxes. IMHO, there's not a real reason to keep images and tweets on the same server if you're going to need an external disk box anyway. Rather than getting a 4u server with a lot of disks and a 4u additional disk box, you may as well get 4u servers with a lot of disks each, use one for tweets and the other for images. Anyway, images are fairly easy to scale horizontally, there's not much simplicity gained by having them all in one host, like there is for tweets.

trishume · on Jan 7, 2023

Yah like I say in the post, the exactly one machine thing is just for fun and as an illustration of how far vertical scaling can go, practically I'd definitely scale storage with many sharded smaller storage servers.

jasonhansel · on Jan 7, 2023

Incidentally, a lot of people have argued that the massive datacenters used by e.g. AWS are effectively single large ("warehouse-scale") computers. In a way, it seems that the mainframe has been reinvented.

sterlind · on Jan 7, 2023

to me the line between machine and cluster is mostly about real-time and fate-sharing. multiple cores on a single machine can expect memory accesses to succeed, caches to be coherent, interrupts to trigger within a deadline, clocks not to skew, cores in a CPU not to drop out, etc.

in a cluster, communication isn't real-time. packets drop, fetches fail, clocks skew, machines reboot.

IPC is a gray area. the remote process might die, its threads might be preempted, etc. RTOSes make IPC work more like a single machine, while regular OSes make IPC more like a network call.

so to me, the datacenter-as-mainframe idea falls apart because you need massive amounts of software infrastructure to treat a cluster like a mainframe. you have to use Paxos or Raft for serializing operations, you have to shard data and handle failures, etc. etc.

but it's definitely getting closer, thanks to lots of distributed systems engineering.

dekhn · on Jan 7, 2023

I wouldn't really agree with this since those machines don't share address spaces or directly attached busses. Better to say it's a warehouse-scale "service" provided by many machines which are aggregated in various ways.

sterlind · on Jan 7, 2023

I wonder though.. could you emulate a 20k-core VM with 100 terabytes of RAM on a DC?

Ethernet is fast, you might be able to get in range of DRAM access with an RDMA setup. cache coherency would require some kind of crazy locking, but maybe you could do it with FPGAs attached to the RDMA controllers that implement something like Raft?

it'd be kind of pointless and crash the second any machine in the cluster dies, but kind of a cool idea.

it'd be fun to see what Task Manager would make of it if you could get it to last long enough to boot Windows.

trishume · on Jan 7, 2023

I have fantasized about doing this as a startup, basically doing cache coherency protocols at the page table level with RDMA. There's some academic systems that do something like it but without the hypervisor part.

My joke fantasy startup is a cloud provider called one.computer where you just have a slider for the number of cores on your single instance, and it gives you a standard linux system that appears to have 10k cores. Most multithreaded software would absolutely trash the cache-coherency protocols and have poor performance, but it might be useful to easily turn embarrassingly parallel threaded map-reduces into multi-machine ones.

MayeulC · on Jan 8, 2023

You absolutely can, but the speed of light is still going to be a limitting factor for RTT latencies, acquiring and releasing locks, obtaining data from memory, etc.

It's relatively easy to have it work slowly (reducing clocks to have a period higher than max latency), but becomes very hard to do at higher freqs.

Beowulf clusters can get you there to some extent, although you can always do better with specialized hardware and software (by then you're building a supercomputer...)

dekhn · on Jan 8, 2023

it's been done: https://www.bleepingcomputer.com/news/microsoft/monster-azur...

hinkley · on Jan 7, 2023

Oxide is basically building a minicomputer.