Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.

Boy you are really stretching to make this sound complicated. It's not. You transmit 4 bytes or 8 bytes. Serialization is a memcpy().

You don't have to think about NaNs and Infinities because they Just Work -- unlike with textual formats where you need to have special representations for them and you have to worry about whether you are possibly losing data by dropping those NaN bits. If you want to drop the NaN bits in a binary format, it's another one-liner to do so.

It's funny that you choose to pick on floating-point numbers here, because converting floating-points to decimal text and back is insanely complicated. One of the best-known implementations of converting FP to text is dtoa(), based on the paper (yes, a whole paper) called "How to Print Floating-Point Numbers Accurately". Here's the code:

http://www.netlib.org/fp/dtoa.c

Go take a look. I'll wait.

dtoa() is not even the state of the art anymore. Just in the last few years there have been significant advances, e.g. Grisu2, Grisu3, and Dragon4...

Again, in binary formats, all that is replaced by a memcpy() of 4 or 8 bytes.

(A previous rant of mine on this subject: https://news.ycombinator.com/item?id=17277560 )

> > Length-prefixed binary formats are almost trivial to parse

> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.

Injection (forgetting to escape embedded text) is the root cause of a huge number of security flaws for text formats. Length-prefixed formats do not suffer from this.

What "huge number of security flaws" are you referring to that affect length-delimited values? Buffer overflows? Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.

> JSON currently dominates large parts of that ecology.

JSON wins for one simple reason: it's easy for human developers to think about, because they can see what it looks like. This is very comforting. It's wasteful and full of pitfalls ("oops, my int64 silently lost a bunch of bits because JSON numbers are all floating point"), but comforting. Even I find it comforting.

Ironically, writing a full JSON parser from scratch is much more complicated that writing a full Protobuf parser. But developers are more comfortable with the parser being a black box than with the data format itself being a black box. ¯\_(ツ)_/¯

(Disclosure: I am the author of Protobuf v2 and Cap'n Proto. In addition to binary native formats, both have text-based alternate formats for which I wrote parsers and serializers, and I've also written a few JSON parsers in my time...)



I think people undervalue clean-looking (alphabet-only, few special character) things, things that don't require people to use the symbol-parsing part of their brain. Basically easily human-parseable things. I suspect this phenomenon can be observed in the case of relative popularity of JSON, TOML, YAML and Python plus the relative unpopularity of Lisp, Haskell, Rust and XML. And if we look at protobuf in this context it is not easy to parse for humans, which causes people not to want to use it, developers are not

> more comfortable with the parser being a black box

they're more comfortable with the parser being a black box but the format being relatively easy to parse compared to the parser being easy to understand but the format basically unreadable for a human.


> I think people undervalue clean-looking (alphabet-only, few special character) things, things that don't require people to use the symbol-parsing part of their brain. Basically easily human-parseable things.

The symbol parsing part of the human brain is what parses letters and numbers, as well as other abstract symbols. The division of symbols into letters, numbers, and others is fairly arbitrary. Most people would say “&”, but the modern name of that symbol is a smoothing over of the way it was recited when it was considered part of the alphabet and recited with it.

> I suspect this phenomenon can be observed in the case of relative popularity of JSON, TOML, YAML and Python plus the relative unpopularity of Lisp, Haskell, Rust, XML.

I suspect not: Lisp and Haskell have less use of non-alphanumeric characters than most more-popular general purpose languages, and not significantly more than Python; also, if this was the phenomenon in play, popularity would be TOML > YAML > JSON but in reality it's closer to the reverse.


> The symbol parsing part of the human brain is what parses letters and numbers, as well as other abstract symbols.

I really don't think that's true when you talk about about someone using the latin alphabet, words in that alphabet compared to some other alphabet (e.g. {}():!) and "words" (or meanings) in those. Just as a crude example parsing "c = a - b", where equals and minus are one symbol each and have been taught for a while, is different from parsing "c := a << b" where ":=" and "<<" basically act as a separate meaning someone has to learn to understand. Similar to the difference of latin alphabet and say simplified Chinese.

> also, if this was the phenomenon in play, popularity would be TOML > YAML > JSON but in reality it's closer to the reverse.

There could be somewhat of an sigmoid response to the effect, decreased reaction if you go into either extreme compared to deviating from the average.

I'm not a linguist so it is my speculation, so don't take it too seriously :D


That's exactly what I was trying to say. Sorry if I wasn't clear.


Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.

Especially since text can be arbitrarily long. From that perspective, length-delimited text (I've seen that before in a few file formats, and more notably HTTP) is probably the worst of both worlds.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: