We're talking about deserialising JSONs in the application server here, nobody s...

greggyb · on April 2, 2024

I work primarily in data analytics. It tends to become noticeable in my experience as soon as you're at a few million records[0] on at least one side of a relationship. Especially as we see more columnar databases in analytics, the row count accounts for more than total data size for this sort of thing.

Due to the type of aggregate queries that typify analytics workloads, almost everything turns into a scan, whether it be of the a table, field, or index. Strings occupy more space on disk, or in RAM, so scanning a whole column or table simply takes longer, because you have to shovel more bytes through the CPU. This doesn't even take into account the relative CPU time to actually do the comparisons.

I've never personally worked with a system that has string keys shorter than 10 [1][2] characters. At that point, regardless of how you pack characters into a register, you're occupying more bits with two strings of character data than you would with two 64-bit integers[3]. This shows through in join time.

[0]: Even modestly sized companies tend to have at least a few tables that get into the millions of records.

[1]: I've heard of systems with shorter string keys

[2]: Most systems with string keys I've encountered have more than 10 characters.

[3]: The vast majority of systems I've seen since the mid-2010s use 64-bit integers for keys for analytics. 32-bit integers seemed to phase out for new systems I've seen since ~2015, but were more common prior to that.

imtringued · on April 2, 2024

People use integer data types for primary keys in databases all the time. There is nothing wrong with it.

hinkley · on April 2, 2024

Mostly a deal of defaults on our stack. Tweaked a couple of things in a few places to stop the bleeding. Then had to fix all of the tests.

Dylan16807 · on April 2, 2024

> when you reach that stage, you're likely have already thrown out a lot of other common sense stuff like db normalisation to get there

Don't most databases set a length limit on ID strings?

If you're setting a length limit, and it's made out of digits with no leading zeroes, then you might as well store it as a number. Is there a downside?