Congrats! Being able to run a nice company bootstrapped seems amazing.
Turning 10, you might want to stop ditching WordPress for being 15 on your homepage though ;)
Your customers demand blazing-fast digital products, web standards are evolving at the speed of light, yet you rely on 15-years-old solutions like WordPress that force you to deliver heavy, low-quality user experiences.
AWS has good base building blocks (ALB, EC2, Fargate, RDS, IAM etc). But it takes knowledge to put the pieces together. Thus AWS tries to create services/tools that orchestrate the base blocks (Amplify, Beanstalk) for you, which in my experience always becomes a mess where you don't actually understand what you are running in your cloud setup.
I'd recommend either learning the basic building blocks (these skills also transfers well to other clouds and self hosting) or using a higher level service provider than AWS (Vercel etc) - they do it better than AWS.
I believe I was actually trying to use that. It’s been a few years so my memory is hazy, but isn’t Fargate just a special case of ECS where they handle the host machines for you?
In any case, the problem wasn’t so much ECS or Fargate, beyond the complexity of their UI and config, but rather that Cloudwatch was flaky. The problem that prevented the deployment was in my end, some issue preventing the health check from succeeding or something like that, so the container never came up healthy when deployed (it worked locally). The issue is that AWS didn’t help me figure out what the problem was and Cloudwatch didn’t show any logs about 80% of the time. I literally clicked deploy, waited for the deploy to fail, refresh Cloudwatch, saw no logs, click deploy, repeat until logs. It took about five attempts to see logs. Every single time I made a change (it wasn’t clear the error was on my end so it was quite a frustrating process).
On digital ocean, the logs were shown correctly every single time and I was able to determine the problem was on my end after a few attempts, add the required extra logging to track it down, fix it, and get a working deployment in under ten minutes.
+1, but I'm not sure if the "simple is robust" saying is straightforward enough? It opens up to discussion about what "simple" means and how it applies to the system (which apparently is a complex enough question to warrant the attention of the brilliant Rich Hickey).
Maybe "dumb is robust" or "straightforward is robust" capture the sentiment better?
The usual metric is complexity, but that can be hard to measure in every instance.
Used within a team setting, what is simple is entirely subjective to that set of experiences.
Example: Redis is dead simple, but it's also an additional service. Depending on the team, the problem, and the scale, it might be best to use your existing RDBMS. A different set of circumstances may make Redis the best choice.
Note: I love "dumb is robust," as it ties simplicity and straightforwardness together, but I'm concerned it may carry an unnecessarily negative connotation to both the problems and the team.
Indeed, simple is not a good word to qualify something technical. I have a colleague and if he comes up with something new and simple it usually takes me down a rabbit hole of mind bending and head shaking. A matter of personal perspective?
Is my code simple if all it does is call one function (that's 50k lines long) hidden away in a dependency?
You can keep twisting this question until you realize that without the behemoths of complexity that are modern operating systems (let alone CPUs), we wouldn't be able to afford the privilege to write "simple" code. And that no code is ever "simple", and if it is it just means that you're sitting on an adequate abstraction layer.
So we're back at square one. Abstraction is how you simplify things. Programming languages themselves are abstractions. Everything in this discipline is an abstraction over binary logic. If you end up with a mess of spaghetti, you simply chose the wrong abstractions, which led to counter-productive usage patterns.
My goal as someone who writes library code is to produce a framework that's simple to use for the end user (another developer). That means I'm hiding TONS of complexity within the walls of the infrastructure. But the result is simple-looking code.
Think about DI in C#, it's all done via reflection. Is that simple? It depends on who you ask, is it the user or the library maintainer who needs to parametrize an untyped generic with 5 different type arguments?
Obviously, when all one does is write business logic, these considerations fall short. There's no point in writing elegant, modular, simple code if there's no one downstream to use it. Might as well just focus on ease of readability and maintainability at that point, while you wait for the project to become legacy and die. But that's just one particular case where you're essentially an end user from the perspective of everyone who wrote the code you're depending on.
I like the house analogy, but I like to think of it as if the people building the house did not know how it was supposed to look (or function). This is mostly true, since very few developers know exactly how the end result (product/service) should look and function when the start coding.
e.g. "We did not know where to put the piping at the start, so we put it on the outside and now installing a new restroom is sort of tricky."
This is why nobody can decide if computer science is actually science, engineering, or art. It's such a vast industry that it's clearly all 3 depending on what your doing.
> I'm currently building a skyscraper on the foundations of a bikeshed.
Not sure if you are joking or not, but I often hear similar things and I believe that it misses the point. What constitutes a good foundation in software is very subjective - and just saying "foundation bad" does not help a non-technical person understand _why_ it is bad.
It's better to point at that one small rock (some ancient perl-script that no-one longer understands) which holds up the entire thing. Which might be fine until someone needs to move that rock. Or something surrounding it.
I like this thinking because it's a true reflection of how things work. I strongly doubt any housebuilder goes back to the architect and says "can't do that, foundations bad." They'd explain what the problem is: maybe the design is rated to a certain weight/height, or what's in the ground composition that prevents the requested changes.
We should do the same in software engineering. What exactly in our design (e.g. that Perl script that's running half the operation that we need to investigate) is stopping us?
From the code samples it's hard to tell whether or not this has to do with de-serialization though. It would have been fun to see profiling results for tests such as these.
That's nice - I'd encourage you to play around with attaching e.g. JMC [1] to the process to better understand why things are as they are.
I tried recreating your DataInputStream + BufferedInputStream (wrote the 1brc data to separate output files, read using your code - I had to guess at ResultObserver implementation though). On my machine it roughly in the same time frame as yours - ~1min.
According to Flight Recorder:
- ~49% of the time is spent in reading the strings (city names). Almost all of it in the DataInputStream.readUTF/readFully methods.
- ~5% of the time is spent reading temperature (readShort)
- ~41% of the time is spent doing hashmap look-ups for computeIfAbsent()
- About 50GB of memory is allocated - %99.9 of it for Strings (and the wrapped byte[] array in them). This likely causes quite a bit of GC pressure.
Hash-map lookups are not de-serialization, yet the lookup likely affected the benchmarks quite a bit. The rest of the time is mostly spent in reading and allocating strings. I would guess that that is true for some of the other implementations in the original post as well.
JMC is indeed a valuable tool, though what you see in any java profiler is to be taken with a grain of salt. The string parsing and hash lookups are present in most of the implementations, yet some of them are up to 10 times faster than the DataInputStream + BufferedInputStream code.
It doesn't seem like it can be true that 90% of the time is spent in string parsing and hash lookups if the same operation takes 10% of the time when reading from a filechannel and bytebuffer.
var buffer = ByteBuffer.allocate(4096);
try (var fc = (FileChannel) Files.newByteChannel(tempFile,
StandardOpenOption.READ))
{
buffer.flip();
for (int i = 0; i < records; i++) {
if (buffer.remaining() < 32) {
buffer.compact();
fc.read(buffer);
buffer.flip();
}
int len = buffer.get();
byte[] cityBytes = new byte[len];
buffer.get(cityBytes);
String city = new String(cityBytes);
int temperature = buffer.getShort();
stats.computeIfAbsent(city, k -> new ResultsObserver())
.observe(temperature / 100.);
}
}
My bad - I got confused as the original DIS+BIS took ~60s on my machine. I reproducing the Custom 1 implementation locally (before seeing your repo) and it took ~48s on the same machine. JFR (which you honestly can trust most of the time) says that the HashMap lookup now is ~50% of the time and the String constructor call being ~35%.
Please add https://github.com/apache/fury to the benchmark. It claims to be a drop-in replacement for the built-in serialization mechanism so it should be easy to try.
I don't actively use it, unfortunately. The main inspiration was to faster sort inputs into the https://github.com/BurntSushi/fst crate, which I in turn used to try to build a search library.
Turning 10, you might want to stop ditching WordPress for being 15 on your homepage though ;)
After all, you'll be there in only 5 years!