Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the first Ajax projects I worked on was multi tenant, and someone decided to solve the industrial espionage problem by using random 64 bit identifiers for all records in the system. You have about a .1% chance of generating an ID that gets truncated in JavaScript, which is just enough that you might make it past MVP before anyone figures out it’s broken, and that’s exactly what happened to us.

So we had to go through all the code adding quotes to all the ID fields. That was a giant pain in my ass.



I've been burned by a similar issue too. Lesson here is never to use numbers for things you are not planning to do math on. Ids should always be strings.


Isn't the lesson only that ids shouldn't be floats? If they were integers everything would be fine, but JS numbers aren't integers, even if they look like them sometimes.


Nah, the lesson is broader than that, cause numbers as IDs have a whole bunch of problems and this is just one of them. Eg Twitter has incrementing number IDs and back when they had this whole ecosystem of 3rd party twitter apps (that they have since ruined), half the apps failed when the IDs became too large to fit into a 32-bit int.

If it looks like a number, and it quacks like a number, sooner or later people are going to treat it like a number.


> If it looks like a number, and it quacks like a number, sooner or later people are going to treat it like a number.

Which is perfectly fine; just don't treat it like an int32. :-)


Until you want faster joins, in which case, comparisons of integers tend to be much faster on hardware I am aware of than string comparisons.


We're talking about deserialising JSONs in the application server here, nobody stops you from treating ids as numbers on the database side of things.

But also, this sounds like a premature optimisation. Most applications will never reach a level where their performance is actually impacted by string comparison, and when you reach that stage, you're likely have already thrown out a lot of other common sense stuff like db normalisation to get there, and we shouldn't judge "regular people" advice because it doesn't usually apply to you anyway.

Out of curiosity, have you ever seen an application that was meaningfully impacted by this? How gigantic was it?

----

Scratch that. I've actually thought about it some more, and now I'm not 100% sure it's premature, I have to investigate further to be sure. Question still stands though.


I work primarily in data analytics. It tends to become noticeable in my experience as soon as you're at a few million records[0] on at least one side of a relationship. Especially as we see more columnar databases in analytics, the row count accounts for more than total data size for this sort of thing.

Due to the type of aggregate queries that typify analytics workloads, almost everything turns into a scan, whether it be of the a table, field, or index. Strings occupy more space on disk, or in RAM, so scanning a whole column or table simply takes longer, because you have to shovel more bytes through the CPU. This doesn't even take into account the relative CPU time to actually do the comparisons.

I've never personally worked with a system that has string keys shorter than 10 [1][2] characters. At that point, regardless of how you pack characters into a register, you're occupying more bits with two strings of character data than you would with two 64-bit integers[3]. This shows through in join time.

[0]: Even modestly sized companies tend to have at least a few tables that get into the millions of records.

[1]: I've heard of systems with shorter string keys

[2]: Most systems with string keys I've encountered have more than 10 characters.

[3]: The vast majority of systems I've seen since the mid-2010s use 64-bit integers for keys for analytics. 32-bit integers seemed to phase out for new systems I've seen since ~2015, but were more common prior to that.


People use integer data types for primary keys in databases all the time. There is nothing wrong with it.


Mostly a deal of defaults on our stack. Tweaked a couple of things in a few places to stop the bleeding. Then had to fix all of the tests.


> when you reach that stage, you're likely have already thrown out a lot of other common sense stuff like db normalisation to get there

Don't most databases set a length limit on ID strings?

If you're setting a length limit, and it's made out of digits with no leading zeroes, then you might as well store it as a number. Is there a downside?


Don't care.

A numeric identity is an identity and so is a string.

If you want to math it, it is a number, otherwise... string.

"Will you ever want the 95th percentile PID? Then it is not a number. Move on."


Double precision floats can't represent every 64-bit integer. If you want to math it, what kind of number will you accept?


They are saying not to use numbers unless you need to do math with th thing.

If you need to do math with the thing, use an appropriate type of number, of course.


If you're using a 64 bit integers because you've got some super high precision math you need to do over an enormous space of addressable numbers, like maybe you're firing unguided kinetic energy weapons at enemies on other planets... sure, use big numbers. I'm sure you've got some clever libraries able to do such things reliably, and I won't question why you're using json as your serialization format.

If you're using 64 bit numbers as a high cardinality identity that can be randomly generated without concern for collision (like a MAC address with more noise) -- well, that's an identity and doesn't need to have math applied to it. For example: "What's the mean IP address that's connected to cloudflare in the last 10 minutes" or "what's the sum of every mac address in this subnet?" are both nonsense properties because these "numbers" are identities not numbers, and using a data type that treats them as numbers invites surprising, sometimes unpleasantly so, results.

Of course, because these are computers, all strings are ultimately numbers but their numberness is without real meaning.


UUIDs are great for this. It’s really just a random 128-bit integer, which makes comparisons about as fast as variable-length integers on modern hardware. And they decode to strings which means no application code or API end-user code is going to assume it’s a number.


Absolutely agreed on all points. I like UUIDs. There is still a surprising number of data processing systems which don't have support for 128 bit integers. This makes me sad.


> You have about a .1% chance of generating an ID that gets truncated in JavaScript

I don't follow. 1-(Number.MAX_SAFE_INTEGER / 2*63) ~ 99.9%, so don't you have a >99% chance of generating an ID that gets truncated in js?


IEEE 754 can represent integers larger than MAX_SAFE_INTEGER, just not all of them:

https://en.wikipedia.org/wiki/Double-precision_floating-poin...

That's still going to be a greater than 0.1% chance of hitting a non-representable value though.


It’s been a long long time. I may be remembering the ratio wrong, or we might have been clipping the range a bit.


> or we might have been clipping the range a bit

Well it's a pretty abrupt change. 53 bits work fine, at 54 bits a quarter of numbers get truncated, at 55 it's half.


Why would the value get truncated?


FTA: JavaScript's built-in JSON implementation is limited to the range and precision of a double.

Obviously, not all int64 values are representable in float64 (double).


We have ample computing power today to be rid of floats altogether and use integers, fractions and natural numbers.


With 10 times the memory usage and 100 times the compute power, maybe you could replace floats with something that behaves more like real numbers and covers mostly the same range.

But the resulting type is still going to have its own limitations and sharp edges. Floats are not the right tool for every job but they are quite good at the jobs they are right for. Learning how they work is more useful than lamenting their existence.


It’s about 2.5 times as much memory, if you do base 10. 65k is a little less than 5 bytes to represent 2.

But floats are not the right representation for values that need to exactly match, like an ID, to be sure.

If I’m off by half a cent it’s annoying. If I’m off by half a row I get nothing.

The thing is that almost all of the problems we had in my initial story came from choosing system defaults. All except the PK algorithm.


With densely packed decimals (3 digits in 10 bits), you can reduce the space overhead to as little as 2.4% (1024/1000). The IEEE has even standardized various base-10 floating-point formats (e.g. decimal64). I'd suspect that with dedicated hardware, you could bring down the compute difference to 2-3x binary FP.

However I read the post I responded to as decrying all floating-point formats, regardless of base. That leaves only fixed-point (fancy integers) and rationals. To represent numbers with the same range as double precision, you'd need about 2048 bits for either alternative. And rational arithmetic is really slow due to heavy reliance on the GCD operation.


I was speaking in the context of JSON, where all numbers are decimal.


not all numbers are representable as the particular type of floating point number that js uses

nice pics here: https://en.wikipedia.org/wiki/Floating-point_arithmetic




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: