As a person living with the CJK languages (I’m specifically a Korean), I find that some of these problems are prominent, even in the 21th century.
There are an excessive amount of programs that conflate key presses & text input, and ones that don’t consider input methods.
I use macOS & Linux, and while the default text handling system called Cocoa Text System in macOS handles input methods well, almost all applications that implement it’s own, like big apps like Eclipse and Firefox, don’t get this right.
On Linux, it’s terrifying; I’ve never seen any app that allows input systems to work naturally, and after a week of use you get used to pressing space & backspace after finishing every Hangul word. The Unix-style composability they want (apps should work whether or not input methods are used - and looks like Linux users that use Latin characters don’t use any input methods (opposed to macOS where Latin characters are input by a Latin input system), so looks like this state will persist.
About the emoticons, I’m not that concerned with that since most (if not all) users won’t really input the color modifier separately (or even encounter files that have a separate one), so you can just select a sensible behavior like the #2 or 3 or 4. Users who understand the color modifiers, and other Unicode fiasco will understand what is happening under the hood, and ones that don’t will just think the file is broken and none of the behaviors will make sense, whatever you do.
For decades I've been saying that all text input should go through OS-level IMEs (input method editors). For more complex scripts, the need is obvious, but even writing in English, we can get great benefits from a system that expands abbreviations, replaces easy-to-type sequences with proper Unicode chars, runs little scripts and inserts the output, gives you quick dictionary/thesaurus lookups, gives you emmet-style powers, etc., whenever you're writing and in any app.
Of course, everyone's preferences will be different, so you'll get default IMEs with default configs, but with the idea that you can reconfigure or replace them entirely with systems that work the way you want to work everywhere you input text. There have been utilities that do this sort of things for decades, but they've always been treated as clever hacks rather than standard text input.
In other words, instead of powerful text input methods being the exception, they would be the rule, and apps that didn't use them would be the exception.
> I've been saying that all text input should go through OS-level IMEs
This is so true, all commercial OSes that take i18n seriously do this, while most open source OSes' community (which communication revolves around English) decided that IMEs are an add-on for CJK people.
It's a pity, and this one reason is enough for Linux to be never adopted for ordinary users in the non-western world.
That's not really fair, though. Linux users still overwhelmingly use X windows. There is no standard IME for X. This is just a historical issue. X Windows is OLD -- far older that Windows or Mac (even the old Mac), for instance.
My biggest problem with IMEs in free software land is that we've had groups like Gnome lean on IME developers and push through their vision of how it should work -- even though the people pushing their vision don't use IMEs. I ended up migrating to FCITX just because it was the last hold out not to cave to pressure.
Old really means there's been more time to identify and solve the problem, and the fact it hasn't been cleanly solved is a lot more indicative of priorities than time or tech.
Microsoft and Apple have the incentive of selling to billion-user markets. The open source community, on average, appears to have demonstrated a lack of interest in opening up the user-base further (and expecting that user-base to just roll their own solution creates multiple catch-22 and tragedy of the commons problems).
Why one of the commercial open-source vendors hasn't taken this on as a core challenge, I do not know.
Technically true, but for practical purposes IBus is the standard. That's what Fedora and Ubuntu use out of the box. Firefox beta telemetry shows about 89% IBus vs. 11% FCITX.
> My biggest problem with IMEs in free software land is that we've had groups like Gnome lean on IME developers and push through their vision of how it should work -- even though the people pushing their vision don't use IMEs. I ended up migrating to FCITX just because it was the last hold out not to cave to pressure.
Insisting that there can only be one input type for the entire session rather than one per window. If you are using multiple languages which each require an IME, Gnome's interpretation is completely broken. I had to stop using IBus because of it. It is possible they have changed their mind since then (several years back), but I haven't followed it. Incidentally, it was also the thing that meant I had to stop using Gnome. Before that I was a happy Gnome Shell user :-(
Pretty much. The W window system for the V distributed system predates the Macintosh (as does the Apple Lisa/1983), and W was ported to Unix in 1983. Their immediate successors - X and the Macintosh - came out in 1984, and Windows 1.0 in 1985.
Windows 1.0 was fairly primitive, but Windows 2.0 supported overlapping windows (!) in 1987, coincidentally the year that X11 was released.
> This is so true, all commercial OSes that take i18n seriously do this,
Does Windows really?
> while most open source OSes' community (which communication revolves around English) decided that IMEs are an add-on for CJK people.
IIRC, Fedora ships an iOS/Android like Latin-script IME with autocomplete. It's not mandatory, though.
> It's a pity, and this one reason is enough for Linux to be never adopted for ordinary users in the non-western world.
The Korean situation doesn't really generalize. Among CJK, AFAICT, the Korean IBus IME on Ubuntu 18.04 is pretty broken but the Japanese IME and the various Chinese IMEs appear to be at least OK. (It's a rather surprising situation considering that the Hangul part of a Korean IME should be much simpler than the Japanese and Chinese IMEs.)
Yes, the signs that even English is passing through Windows' IME infrastructure in Windows 10 is pretty minimal by default (in Desktop mode, Tablet mode immediately turns on a couple more), but at this point Windows 10 makes almost all of it opt-in. Some of it is referred to as accessibility tools from an English perspective, because IMEs are also useful for accessibility.
The "big one" IME for most English users is the Emoji Keyboard accessible with Win+. or Win+; (whichever you prefer). It's really interesting how well emoji have helped Latin script users with further understanding the complexities of Unicode, fixing old ugly bugs in Unicode handling, and even introducing some such users to an IME that they want to use (sometimes every day).
Under Settings > Devices > Typing > Hardware Keyboard you can turn on the IME "Show text suggestions as I type" even on a hardware keyboard in Windows 10, which gives you mobile-style auto-suggestions (you can also turn on mobile-style autocorrect even on a hardware keyboard).
> The "big one" IME for most English users is the Emoji Keyboard accessible with Win+. or Win+; (whichever you prefer).
No, API-wise the on-screen keyboard generates keystrokes for emoji (astral keystokes, multiple keystokes for multi-scalar-value emoji) even to an IME-aware app. In contrast, the emoji picker built into the Windows 10 Pinyin IME enters emoji via IME API.
> Under Settings > Devices > Typing > Hardware Keyboard you can turn on the IME "Show text suggestions as I type" even on a hardware keyboard in Windows 10, which gives you mobile-style auto-suggestions (you can also turn on mobile-style autocorrect even on a hardware keyboard).
The emoji keyboard may not be entirely using the IME API, but it does do some IME-like things even in English. The big thing I'm thinking of is the way it works when you type English words to search the emoji. I think it still most often defaults to passing the keys along to the application as well and replaces it with keyboard keystrokes or selection APIs, but I have seen it sometimes do the IME thing that the text you are typing is shown underlined and not sent to the application. Though in mentioning it, I don't recall the exact combination of app and emoji I was trying to find where I saw that happen or know precisely enough why it would vary in order to reproduce it just now, and maybe that was just a difference between early Insider versions of the emoji keyboard and current operations or something similar that I'm misremembering.
That said, even if it isn't using the IME APIs directly in most cases, it's still useful as a teaching tool/analogy tool/example tool to English speakers of what an IME can be like to use, even if a nice-to-have for an English writer versus a necessary required tool for other languages.
On debian 9/10 ibus works for the Korean IME (both Hangeul and Hanja input).
Windows (10 at least) has native support for these IMEs; setting it up on Windows was much easier than on Linux (which doesn't even have a standard IME). Windows also comes with basic CJK fonts, which different distros may or may not have. On Debian I have to install noto-cjk or Adobe's source han fonts.
I originally tried fcitx but had to switch to ibus. Definitely not a great experience.
There isn't a single IME setup experience on Linux.
In my experience, Debian is worse than Fedora, Ubuntu, and openSUSE. Fedora is better than Windows 10 when it comes to IMEs: Fedora installs the IMEs by default. Ubuntu, like Windows 10, installs IMEs when you request the addition of an IME-requiring language. OpenSUSE gives you an IME for the language you use at install time if you install openSUSE using an IME-requiring language.
I haven't tried Debian 10, but when I installed Debian 9 _in Japanese_ with a _Japanese keyboard layout_ chosen, the installer didn't bother to set up a Japanese IME!
Fedora comes with an OK set of Noto CJK fonts by default. Ubuntu comes with a minimal set of Noto CJK fonts by default, but when enabling Chinese, Japanese, or Korean, Ubuntu drops more language-appropriate fonts on the system, like Windows 10.
> For decades I've been saying that all text input should go through OS-level IMEs (input method editors).
Which is why it's infuriating when the built-in widgets of a platform don't include simple operations for obvious workflows like filtering characters, masked input, etc. That's what leads developers to have to roll their own implementation using keydown/keyup.
I can't tell you how many GUI frameworks have forced me to roll my own numeric input widget.
Wouldn't it be enough if there was an obvious event for actual text input? WPF for example has that and it's the obvious and IME-friendly choice, since of course not every keystroke results in a character (dead keys exist too, after all).
Totally agree. The best use case here is hooking up your keyboard to a personal database to autocomplete notes you saved, like locations, bookmarks, and so on.
I have a note-taking system that I use to store all this data, and after I set it up years ago I looked at implementing an IME to search my notes and get a link to an entry, but every platform has its own byzantine IME API and I never got anything working. I've been thinking it might be easier to just make a cross-platform program that interacts with the clipboard rather than being a true IME.
In general one effect I'd like to see from having flexible IMEs would be allowing logging and chat could be left to separate apps. Imagine a birdwatching (or whatever) keyboard that lets you pick the birds you saw, their age, behavior, and punch it in while you're in the field, and have it output in structured form into whatever app is handy.
So that sort of makes sense, but then I realize I am looking for all of those behaviors to be different in different contexts (such as writing code vs. writing an essay).
> we can get great benefits from a system that expands abbreviations, ... emmet-style powers
I can't agree here (excepting perhaps the unicode char replacement which a non-english speaker needs to comment on the viability of, vs actually having a non-english keyboard (excepting huge alphabets like chinese where that is used already AFAIK)).
Writing is fundamentally a thinking process, the text entry for a touch typist is relatively quick. If you are to propose expanding abbreviations, how much time do you expect to save? I mean, actually measured it?
> runs little scripts and inserts the output
What is the purpose of this?
> gives you quick dictionary/thesaurus lookups
I use those perhaps once a week or less. If you use it say 3X a day, you'd save little time, perhaps a couple of minutes.
Dunno what emmet powers are though.
MS word & libreOffice does some of what you want and the first time I install them, I spend several minutes tracking down each setting and turning them off - they drive me bonkers. They think they know what I want but they don't. Touch typists can hit many keys a second and kind of pipeline their typing. Having the input modified automatically is rarely useful IMO.
Your idea may be good but like many other ideas such as graphical programming, except in restricted cases they don't work. Perhaps if you measured it I'd be convinced but I can't accept it now as an obviously great benefit.
> Writing is fundamentally a thinking process, the text entry for a touch typist is relatively quick. If you are to propose expanding abbreviations, how much time do you expect to save? I mean, actually measured it?
Not quick enough. I think faster than I type, and I type fast. It's fine until I start getting impatient with myself. Call it micro-impatience, a flash of irritation in which you're suddenly conscious of not having finished typing in the thought. It's distracting.
Honestly, I like parent's idea; a good chunk of Emacs's awesomeness and the reason people like me use it as an operating system is because of that - unified, fully configurable and expandable text-based interface. I often wish to have something like it system-wide, because standard UIs are very far from optimum ergonomy-wise. But then again I wouldn't trust Apple or Microsoft to do it right; they'd quickly find a way to dumb it down, or restrict the extensibility in the name of security.
> the reason people like me use [Emacs] as an operating system is because of that - unified, fully configurable and expandable text-based interface. I often wish to have something like it system-wide..
I can totally imagine that. Underneath all the GUI layers, every operating system and application has (or has the potential of) a fully text-based interface. There's just no standard or integration, and tools that allow that (like a system-wide middleware) haven't caught on, I guess. Maybe in an alternate historical timeline, such a feature could have been a fundamental layer of an OS.
From the grandparent comment:
> a system [with powerful text input methods] that expands abbreviations, replaces easy-to-type sequences with proper Unicode chars, runs little scripts and inserts the output, gives you quick dictionary/thesaurus lookups, gives you emmet-style powers, etc., whenever you're writing and in any app.
Yes, yes - and the last point: in any app. I picture it like how TCL can script other programs, even ones that weren't designed to be "remote controlled".
Yeah, I had some vague doubts while I was writing that comment. I guess I meant "text-input based", or maybe better to say "keyboard based" with a system-wide/application-agnostic middleware of some kind.
Not that. More like, UX paradigm more fixed and forced on applications, but also being customizable and user-programmable externally to any given application. So that e.g. you could have a system-wide autocomplete/code completion, whether you're in a code editor or text editor or in a dialog box of some other program somewhere; that system-wide autocomplete would be configurable and trivial to extend or replace wholesale with another widget.
This is a reality within Emacs (which really is a 2D text OS running lots of applications inside, including a text editor), and being text-based does play a role. When it's very hard to draw arbitrary pixels on screen and most of all apps deal with text, it's easy to make a large set of very powerful interface tools, and it's easy to pull data out of an app and put data into it, whether the app intended it to happen or not.
In the back of my mind, I sometimes wonder how something like Emacs could be made with modern browser canvas, to enable cheap rich multimedia, while retaining the ability for inspection and user-programmability. Introducing arbitrary GUIs is hard, because next thing you know, half of the stuff is drawing to canvas directly and it's all sandboxed away from you.
> user-programmable externally to any given application
I think this is why it reminded me of TCL, specifically the "expect" command that can script apps that know nothing about it. From the Wikipedia page, the TCL Expect extension: "automates interactions with programs that expose a text terminal interface".
So how I imagine this "Emacs as an OS" paradigm you're describing, is that it mediates interactions with any and all apps that expose a text input/edit interface, to allow programmatic customizations.
Like I'd love to script my own shortcuts for Firefox (or other apps) - possibly with multiple steps, taking input from some config file, or sending a link to another app.. Or, as you mentioned, Emmet-style expansions that work in any input field or textarea..
To address just one point, in emacs you've dabbrev-expand (bound to M-/). I like it and use it but it is not automatic. I have to invoke it myself which means it can't get in the way.
If you want larger clumps of code then you have various options such as skeleton mode, but again that's something the user has to ensure happens - again they remain in control.
> But then again I wouldn't trust Apple or Microsoft to do it right
> I have to invoke it myself which means it can't get in the way.
For the sake of completeness, you can always make it automatic. All it takes is to add a function to post-self-insert-hook, and make it e.g. call dabbrev-expand if you pressed space twice. So you can have it any way you like - manual, automatic, semi-automatic. You're in full control.
> If you want larger clumps of code then you have various options such as skeleton mode
Yes. I currently use yasnippets for code. Still, my favourite yasnippet is one I use in comments - it expands "todo" into: "TODO: | -- [my name], 2019-10-29.", and similarly for "note", "hack" and "fixme". | is where the caret ends after expansion.
That's the kind of flexibility I wish my OS had. Unfortunately, it goes against the commercial interest of mainstream OS providers.
IIUC the parent was suggesting an OS-level system that _supports_ these features natively, as the foundation layer for any number of userland tools to sit atop... vs your compelling arg for why said features must be straightforward to disable. I don't see a conflict.
True, upvoted. My point was that these facilities are of questionable value (I'd like to see how much time they really save, or indeed even lose when triggered accidentally), and that they have to be in easy control of the user. With MS there's too much "we invented it so you're getting it", and bad designers (who always outnumber good) will do the same.
Actual example, was working with vis studio with another guy. Open a bracket and VS automatically added a closing bracket. That is fucking annoying and saves you a whole keystroke while breaking muscle memory and interfereing with our work. We had trouble turning that off.
This seems an awful lot like arguing that, since you are happy at the CLI, there's no need for a GUI. Widespread availability of rich OS-level IMEs wouldn't hurt anyone, and could help everyone. (Even for someone like you who wants to pass through raw hex, it's easier to tell that once to the OS rather than to have to argue with each app individually.)
I was unclear. I'm not against anything, just saying let the user control it, just make sure some great idea actually works (user testing shows many great thing aren't; people are complex and so are their mental models) then not force it on users.
I think the point is to get native behavior, you need to use the native functionality. When you try to emulate it, you'll always be missing something.
We see this in every corner of Firefox: the text editing, the toolbar, the context menus, the form controls, etc. Yet Firefox seems to be all about writing everything from scratch. There are many bugs filed for using native functionality, like [1], that have been open for decades with no activity. 20 years ago, it wasn't done because "it's a time thing", and since then it's racked up other bugs as dependencies because stuff's just broken.
This isn't a situation where you can tweak a couple little problems and call it done. This is a fundamental change in the Firefox architecture. Asking for more bug reports is not going to help.
Firefox uses native IMEs. Most of the time they aren't broken in obvious ways. To the extent they are broken in non-obvious ways, it's unhelpful not to say how on the level that would allow the problem to be reproduced.
Firefox can't just use native text edit controls. First, Firefox needs to support contenteditable, which doesn't map to an OS-supplied text box. Second, the multiprocess architecture leads to a situation where the UI process talks to the native IME API but the Web content process hosts the text being edited, so native text boxes don't work even for things that look similar to OS-supplied text boxes.
Hi. Sent in feedback yesterday concerning addon privacy. Specifically, the option to grant or restrict addon access on per site basis.
I'd want to be able to right click an addon icon in the toolbar and click, "Don't run on this site." And have even more options in the extension detail page.
> On Linux, it’s terrifying; I’ve never seen any app that allows input systems to work naturally, and after a week of use you get used to pressing space & backspace after finishing every Hangul word. The Unix-style composability they want (apps should work whether or not input methods are used - and looks like Linux users that use Latin characters don’t use any input methods (opposed to macOS where Latin characters are input by a Latin input system), so looks like this state will persist.
Living in Japan and using Linux almost all the time, I never remember having any problem with typing in Japanese whatsoever.
Korean case is totally different from Japanese. I guess Korean is very unique in this area, because Korean Hangul characters consists of multiple small characters. Input systems needs to have a state system to allow type multiple small characters to complete a character. The problems usually come from improper state systems.
Korean is different from Japanese, but as far as IME complexity goes, a Hangul IME should be extremely simple to develop compared to a Japanese IME. (The Hangul part of a Korean IME doesn't need any pop-ups like a Japanese IME does. As far as UI requirements go, the Hangul part of Korean IME can be as UIless as a Vietnamese Telex IME.) That e.g. on Ubuntu 18.04 the Korean IME is broken is not due to anything intrinsic to the writing system.
Yes, Korean (specifically Hangul) additionally suffers from NFC/NFD issues which is not experienced by Chinese or Japanese. I’ve had the privilege (/s) to work with Korean file names in the past and it was a nightmare.
Hangeul is not an alphabet. It's an alphabetic syllabary. ㄱ is one character but it needs to be composed into a final glyph like 각, which require 2+ characters. 가, 각, 갉. 뷁.
Kanji (not to be confused with Hiragana or Katakana) are different because the characters are already composed.
Yeah I had a brain-fart earlier. I totally forgot about that. Typing hangeul is basically the same except there's no need to possibly choose different hanja/kanji. ...unless you press the hanja key after the glyph is typed but before moving onto the next glyph. (Usually F9 or F10, iirc Windows IME defaults to ctrl+space).
Kanji is logographic: each symbol is a complete word, phrase, or idea. Hangul is alphabetic-syllabic: each symbol is a segment (vowel-or-consonant), except that they're written in two- or three-letter blocks each representing a syllable.
I'm not sure it's that different IME wise. I only know Chinese pinyin IME, but I assume since Kanji in contrast to Chinese can map to multiple syllables you probably need to keep the state of the last few syllables as well and then let the user choose the appropriate Kanji (if available).
With pinyin input it's the same in Chinese. You enter the Latin characters and the IME gives you options to select from. Also even abstracting from the single syllables you often can narrow down the selection of compound words in Chinese IME if you continue typing. So again state is important.
Correct, this is why I'm wondering. Input is done through hiragana such as "わたし" (or even a step further removed, transliterated from "watashi") and then the IME is triggered to convert it to "私" (or other matches).
On a laptop, my experience is that romaji -> hiragana is as-you-type due to being unambiguous, while the default trigger for hiragana -> kanji is the spacebar, same as described for Korean. Hence my confusion as to how it's different - the individual characters certainly are, but it sounds functionally identical.
On my phone just now, typing this comment, the experience was switching to a Japanese keyboard and inputting the hiragana directly, then the kanji suggestions appeared where autocorrect suggestions normally would.
Typing Hangul is just like typing Latin, Greek, Cyrillic, Hebrew, Arabic, etc., text: one alphabetic unit at a time. The only thing a Hangul IME does is it groups the typed jamo into syllables. The grouping is unambiguous, so there's no need for popups or space presses to guide the grouping.
(If there had been the kind of rendering technology that is used for Indic text today back when Korean text processing on computers started, chances are that the syllable grouping would be handled as rendering-time shaping and not as an input-time IME issue.)
Additionally, Korean IMEs have a feature to convert a word into Hanja, but it's something you need to take action to invoke as opposed to Japanese IMEs offering to convert to Kanji by default.
> The grouping is unambiguous, so there's no need for popups or space presses to guide the grouping.
...that's the opposite of the comments that triggered my question. Multiple people said space is needed to trigger it, then backspace to remove an erroneously-added space character.
Is the answer that they're using it wrong, and are actually inputting a space directly because the IME already acted for them?
This is interesting. Not doubting your story, but my personal experience, as a Cyrillic user, is opposite: at least in early versions of Windows apps I constantly struggled with a random mix of hardcoded assumptions for encodings, key presses, characters and data stored which often produced gibberish on screen.
When I switched to Linux everything just works out of the box: I can copy and paste text between gvim, xterm, etc. with no issues. I admit that this is likely due to app writers, not underlying OS. And my experience is only with single-byte characters. Just my 2c.
Hadn't fully registered till this comment, the degree to which the modern web is anchored to horizontal (usually left-to-right) writing and the design patterns of vertical scrolling that come with that assumption.
It's not just the web, it's all of computing, even down to the hardware design. A vertical scroll wheel is standard on all mice. A horizontal scrolling method of some sort is not, and even on mice that include one, it's usually not as good as the vertical one (e.g. leaning the wheel left and right).
> I use macOS & Linux, and while the default text handling system called Cocoa Text System in macOS handles input methods well, almost all applications that implement it’s own, like big apps like Eclipse and Firefox, don’t get this right.
What specific IME problems do you have with Firefox on Mac?
> On Linux, it’s terrifying; I’ve never seen any app that allows input systems to work naturally, and after a week of use you get used to pressing space & backspace after finishing every Hangul word.
Do you mean you have to press space twice and erase the second space? With IBus?
There are no spaces between words in Chinese or Japanese.
Pressing space confirms the current selection in the Japanese IME, which is expected behavior. Where some Linux implementations get it wrong is they also insert a space after the word, meaning the user has to select the desired word in the IME with the space bar and then remove the erroneous inserted space.
Edit: Correction based on feedback below. Previously stated that Hangul does not have spaces.
Wrong, there are spaces between words in Korean. It’s in Japanese and Chinese that there isn’t. And in Vietnamese there are spaces between everyone syllables, even in words.
Not sure your comment on Vietnamese is accurate though. I work in a company with ~35% native Vietnamese speakers and I’ve seen plenty of multi-syllable words.
Are you talking about traditional Vietnamese (when it still used Chinese characters) or modern Vietnamese (post-French-colonialism) which uses the Latin alphabet with accents?
It is accurate, there are spaces between each syllable in modern written Vietnamese, except in foreign words. The syllables can have as many as 7 characters, and you need an IME to type the tone marks. The written language looks like this: https://vi.wikipedia.org/wiki/Vi%E1%BB%87t_Nam
That was not my point. I mentioned Vietnamese because the spacing it uses is interesting.
Also like dfcowell said, Vietnamese used to be written with han & chu nôm characters (respectively Chinese characters and Chinese-like characters created by Vietnamese), a lot of which are encoded in Unicode. Hence the existence of the CJKV acronym.
It is actually CJKV to deal with historical Vietnamese.
From Unicode spec:
> Although the term “CJK”—Chinese, Japanese, and Korean is used throughout this text to describe the languages that currently use Han ideographic characters, it should be noted that earlier Vietnamese writing systems were based on Han ideographs. Consequently, the term “CJKV” would be more accurate in a historical sense. Han ideographs are still used for historical, religious, and pedagogical purposes in Vietnam.
Are you using ibus? I'm on debian and once a character is finished, it automatically moves onto the next character (unless you press space, enter, etc). All I have to do is ctrl+space to switch input methods.
I've had issues with terminals like the character building not working in alacritty. The only major annoyance I've found is having to install and configure ibus/ibus-daemon, and CJK fonts.
> On Linux, it’s terrifying; I’ve never seen any app that allows input systems to work naturally
Kind of interested to hear what sort of input method you're using. Even my terminal emulator supports Japanese input methods well, through IBus. Maybe it's just that as a non-native I'm more often going to convert chunk by chunk anyway, I have noticed that some input methods do not do bulk conversion well; I must say I never thought hangeul would even warrant conversion other than composing the syllables, is it because you mix in hanja conversion? I think that your problems will mostly a matter of the quality of the input method, IBus and the input method system in GTK+ at least seem to be not preventing anyone from writing better input methods.
I feel it with Firefox though, then again Firefox is very poor quality software in my experience, almost everything is at least subtly wrong, and there seems to be more interest in niche feature work than basic work on product quality. I could in some ways say the same for Eclipse, every time an SDK I want to use is only documented in terms of their special Eclipse frontend, I get a bit depressed.
> and looks like Linux users that use Latin characters don’t use any input methods
Right, I prefer slim systems and I typically uninstall everything input method related that my distro has chosen to preinstall. I cannot read or memorize a single CJK character, so why would I need that.
Aditionally programming and computers for me means English, although that is not my mother tongue. I would never install anything in my mother tongue.
Basically I use my mother tongue (and a couple of other European languages I speak) only in Email, chat or maybe some web form. I can feel your pain though, because 10+ years ago we had the same problem with the couple of non-ASCII characters you need in most European languages.
In order to have the situation in Linux improve there just need to be enough CJK contributors to fix existing bugs. And reviewers / unit test cases to make sure we Westeners don't break it again with our next commit.
> Right, I prefer slim systems and I typically uninstall everything input method related that my distro has chosen to preinstall. I cannot read or memorize a single CJK character, so why would I need that.
Yes, I'm exactly talking about this mindset.
This is basically why Linux has such poor input method support.
Because English has a special privilege of not needing input methods to be input in, combined with the fact that the majority of Linux application programmers use English only, that means basically all apps that don't consider i18n seriously are by default 'wrong', opposed to apps running on Windows/macOS which are by default 'right'.
> Basically I use my mother tongue (and a couple of other European languages I speak) only in Email, chat or maybe some web form.
Does that mean European languages are able to being input without special input methods?
> I can feel your pain though, because 10+ years ago we had the same problem with the couple of non-ASCII characters you need in most European languages.
The non-ASCII characters fit in the character array model that most western people think in, and as a plus they are fittable in the upper half of ASCII.
Asian CJK languages require a different model from the western ones.
> In order to have the situation in Linux improve there just need to be enough CJK contributors to fix existing bugs.
It's a failing fight. That only works on an ideal world where every program has enough contributors. Thats not true.
>> Right, I prefer slim systems and I typically uninstall everything input method related that my distro has chosen to preinstall. I cannot read or memorize a single CJK character, so why would I need that.
>Yes, I'm exactly talking about this mindset. This is basically why Linux has such poor input method support.
Don't understand me wrong. That input methods are useless for me, because I know zero CJK characters, does not mean I think they are useless to everybody or Linux in general. How would I help the CJK users by having something installed I never use and I have no knowledge to use?
> I mean, there is a reason why Windows & macOS all selects a similar architecture on text inputting
Yes, there is a reason. Microsoft and Apple want to make money in CJK countries. And they have architects that make system-wide decisions.
That is not how Linux works. Companies contribute where their business is. That is server or embedded. After Canonical closed bug number 1 https://bugs.launchpad.net/ubuntu/+bug/1 no big player is interested on Linux on the desktop anymore. Individual contribute what they are interested in and what they know best. I fear most Westerners don't understand the challenges of CJK and other more "complicated" scripts. I myself "blame" Americans if they do it wrong and something accepts only 7 bit ASCII. I can fully understand if CJK or right to left people blame us "8 bit people" (character set, not coding) for doing it wrong. We just don't get it, that's a fact. But I don't think studying Korean etc. is a realistic solution. The only way to change it is to have more people and companies contribute that a) need it and b) really understand the user needs.
> Does that mean European languages are able to being input without special input methods
I cannot talk about East-European languages or really small languages. But for the bigger West and Central-European languages the answer is yes.
Every character is either on the national keyboard or (if typing another language) can be typed using dead-key accent or AltGr. Sometimes the compose character is needed, but I need that so rarely that I forget the combinations...
>Aditionally programming and computers for me means English...
Linux isn't just for programming. I personally hate that I have to dual boot because ubuntu can't do what windows or mac can. Anyhow, English has plenty of words containing non-keyboard characters, they're just infrequent.
This post reminds me of one of Jon Skeet's seminal SO answers[1] on subtracting dates. Things we take for granted are hard. Dates are hard, text is hard, graphics are even harder. We stand on the shoulders of giants. It's important to stop and smell the roses every now and then.
> Things we take for granted are hard. Dates are hard, text is hard, graphics are even harder. We stand on the shoulders of giants.
More like we stand on piles of broken shit. A lot of "hard" things in programming have very little fundamental complexity. They are hard because at some point in time someone made a compromise that no longer is relevant or makes any sense. It is possible to redesign those things to be simple, but modern developers are conditioned to fight this at every step. The idea that the first thing you should try to do when faced with extraordinary complexity is to bypass it is completely alien to people whose daily jobs consist of continuously banging their heads against convoluted frameworks and deficient APIs.
Time handling is a great example. Until you get into general relativity, time itself is extremely simple and uniform. But human representations of time are extremely convoluted, because historically they are based on astronomy. Here is the kicker: 99.99% of all software has nothing to do with astronomy, and yet people chose to inject those complexities in core structures that track time. (For example, Unix timestamps have leap seconds, which is a source of innumerable bugs and ridiculous convolutions in code to prevent those bugs.)
If by 'based on astronomy' you mean 'based on the sun' because the sun is a star, then I agree. But, plenty of software has lots to do with the sun because they have to interact with people, and people like to be synchronized with the sun.
If you don't want software to interact with humans, or you only care about duration rather than human datetime, then go ahead and use UT1. However, as long as your interacting with people, you're going to need to fix people first. You should start with these primers if you're going to fix how humans interact with time:
> Until you get into general relativity, time itself is extremely simple and uniform.
Just a little niggle, but I think you meant to say special relativity; special relativity is about the relationship between space and time, whereas general relativity (I believe) is more about gravitation and other things that I definitely don't understand.
I remember trying to debug this exact “issue” in a java project years ago. I poked around the library to figure out what was going on, which only raised more questions, before I eventually stumbled upon that exact SO thread. Turns out dates have an enormous amount of complexity.
Date and time are complex because they are social and historical constructs. Everything about them is arbitrary and exists because of arcane historical reasons. No regular system of organizing time maps cleanly onto the movement of heavenly bodies because those movements are not regular (hence: leap seconds). You can describe time in a way that is precise, or in a way that is intuitive, but not both, and woe to anyone trying to convert between the representations.
Time zones are the worst because they are political tools, changed for political reasons, often with very little lead time. I think of them as a one way function mapping universal time onto the local political reality.
It's easy for me to write something in Russian, I just hire a Russian translator for me.
That is, "convert time into time in seconds since the epoch" is the hard part. All you've done is make it someone else's problem, while the question Skeet answered asked specifically about the time difference between 1927-12-31 23:54:08 and 1927-12-31 23:54:09 in Singapore.
Yes. This is beautifully expressed. And 'hiring a russian interpreter' is almost always the right way to fix this kind of problem. POSIX would seem to me to be the russian interpreter of choice here. Is java a language which has no word for the colour green? in the sense where the eponymous russian interpreter has to say: "concept which cannot adequately be expressed in russian" or something
The POSIX spec exists for this reason. This feels like a variation of "omg utf-8 is hard" which is why people embed utf handling in the language python2 -> python3
It seems less than graceful to respond to articles which point out the existence and difficulty of problems by saying it's fixed. The entire point is to show how it's more difficult than many think, not elide the issues.
What does POSIX have to do with anything? Eg, the tz database isn't POSIX.
The POSIX spec of date time specified conversions into a neutral form. Maybe it's C centric thinking but it felt like a problem which has existed and been solved by writing a 'russian interpreter' solution: convert the values on the edge to a canonical form and use functions which manipulate the canonical form. The TZ files i only mention because within the limit of modern epoch Times and dates they do a pretty good job of handling the minutia of what a local time means in any given period.
I'm sorry if I came across as churlish. Do you feel this is a problem of concept in abstract, or in java, or in java implementation or what? I feel it's a problem which I solve, by standing on the shoulders of smarter people and using the code they wrote. I don't reimplement the wheel because I distrust my own wheelmaking skills.
I'm sure there are heaps of non-trivial corner cases I don't have to deal with and I do not mean to minimise work done to explain a problem.
The problem would be, I think, when your business logic is concerned with some subtle aspect of Russian, you translate into canonical English and the detail you're interested in disappears.
When it comes to timestamps, how do you handle midnight for example? 24:00 today represents the same instant as 00:00 tomorrow. The difference might matter to the user but disappears when converted to Unix timestamps.
> I think it was in Godel, Escher, Bach where he talks about the problem of translating a line from Russian, along the line of "He lived on B____ street." It's possible from the story, which takes place in a real city, to figure out which street that was. Let's say it was "Main Street". Does the English translation keep the original Russian word and initial? Does it translate to "Main Street" and replace the "B" with an "M"?
You can use a Russian-English translator to get a translation into English. But that doesn't help understand the hard problems involved, which was the point of Hofstadter's essay. (See, for example, his "The Shallowness of Google Translate" in The Atlantic; HN comments at https://news.ycombinator.com/item?id=16267363 and in dupes.)
This reminds me of Milan Kundera's "Testaments Betrayed", which expresses the profound challenge (near impossibility) of translations doing justice to original works.
> I always convert time into time in seconds since the epoch
You should still be careful how you define a _second_. A POSIX second is not the same as an SI second. A POSIX second corresponds to 1/(24 * 60 * 60) of a day. An SI second is defined using atomic time measurement. It means that the duration of a POSIX second depends on the length of the day, and some days are longer than others due to leap seconds.
The worst part is that most tools rely on POSIX timestamps instead of TAI timestamps (based on the SI definition). One of the consequences of using POSIX timestamps is that taking the difference between two timestamp does not safely return the number of elapsed seconds.
> A POSIX second corresponds to 1/(24 * 60 * 60) of a day. An SI second is defined using atomic time measurement.
I don't think that is quite accurate. Traditionally Unix Time seconds are SI seconds, that tick in sync with UTC and atomic clocks (modulo clock error). To account for leap seconds, Unix Time has traditionally been discontinuous when a leap second occurs. Each second that ticks is still an SI second, but the sequence of Unix Time around a leap second will be X-2, X-1, X, X, X+1, X+2, etc.
This discontinuity means that every Unix Time day has exactly 86,400 seconds cumulatively, even though some days tick 86,401 elapsed SI seconds. So it is true that computing a difference between two Unix Time values does not give you a reliable measure of how many SI seconds have elapsed, as you say. But the individual seconds are still all SI seconds.
More recently it has become popular to smear the leap second instead. Google pioneered this approach in 2008 (https://googleblog.blogspot.com/2011/09/time-technology-and-...) and now advocates making it standard (https://developers.google.com/time/smear). This approach does vary the actual length of a second during the 24 hours surrounding a leap second. But this doesn't come from POSIX, it's a departure from previous practice that is designed to avoid a discontinuity.
Yes, I definitely agree that the reality is even more complicated and there are different implementations. The recent time-smearing approach is closer to the POSIX standard [0]:
> How any changes to the value of seconds since the Epoch are made to align to a desired relationship with the current actual time is implementation-defined. As represented in seconds since the Epoch, each and every day shall be accounted for by exactly 86400 seconds.
Another fact that I find interesting is that the use of discontinuities or time-smearing of the leap days are actually modern approaches. Nowadays, POSIX and TAI differ by an integer amount of seconds on regular days, and a fractional amount when smearing the leap-second day. At the beginning, the leap second duration between POSIX and TAI was smeared over years from one leap second to the other. So the difference between both timestamps would change continually other these periods!
My main takeaway is that POSIX timestamps are based on calendar days while TAI timestamps are based on physics constants. POSIX timestamps are a tradeoff: they're not suitable for time computations (durations, or even uniquely identifying an instant when discontinuities are involved) but they don't require a leap seconds lookup-table to display time for humans... in UTC... Which means you still need lookup tables for timezones, countries, DST, calendar changes, etc. Basically my opinion is that POSIX time is neither the best to precisely keep track of time nor to format it for humans. It just sits in between these two use cases because it was good enough at the time and then too hard to change.
> At the beginning, the leap second duration between POSIX and TAI was smeared over years from one leap second to the other. So the difference between both timestamps would change continually other these periods!
Do you have any reference to a system that was implemented this way? I have never heard of such a thing and I don't see how it could be implemented.
The insertion of leap seconds is unpredictable and occurs when the International Earth Rotation and Reference Systems Service determines that the difference between UTC and UT1 are approaching 0.9 seconds. Leap seconds are generally only decided six months or so before they occur. So you don't know at the beginning of the year how many leap seconds the year will have. So it doesn't seem possible to smear over a year.
I remember reading it from Wikipedia some times ago. I looked it up again and I was wrong: this was not applied for POSIX but UTC itself. The arrival of international networks seems to actually be the reason why they stopped smearing it continually and switched to a linear version with discontinuities. [0]
> UTC is a discontinuous time scale. From its beginning in 1961 through December 1971 the adjustments were made regularly in fractional leap seconds so that UTC approximated UT2. Afterwards these adjustments were made only in whole seconds to approximate UT1. This was a compromise arrangement in order to enable a publicly broadcast time scale; the post-1971 more linear transformation of the BIH's atomic time meant that the time scale would be more stable and easier to synchronize internationally.
[1]: McCarthy, Dennis D.; Seidelmann, Kenneth P. (2009). Time: From Earth Rotation to Atomic Physics. Weinheim: Wiley-VCH Verlag GmbH & Co. KGaA. ISBN 978-3-527-40780-4.
Obviously you would set TZ before up-converting. I spoke in UTC about an input time I specified in UTC. POSIX date time models understand timezones and on well configured systems understand prior states of time zone behaviours at boundaries like summertime into historical past states. The 1990s USENET discussions did at one point veer off into rotational speed of the earth MYA ago, I'm not sure many people cared but astronomers maybe have to. (I don't think dinosoars used summertime)
Historically correct timezone handling is one thing, but you also have to consider future timezone changes unknown at this time. For example some EU countries might decide to cancel daylight savings. Now your conversion from UTC is off by an hour.
How much is poor text input eroding the use of non-latin script worldwide? I.e. how many just give up instead of using poor input?
As an example: I use a Swedish keyboard. If I'm in a situation where ö is missing from the input (because the app is stuck in an en-US keyboard layout say, so I have to hit a modifier ¨+o to type an ö) then there is a very good chance my communication with my colleagues would just naturally be in english instead. I'd sigh and just give up using it. Rather than using an o for my ö I'd just type it in english instead.
Is this a thing in e.g. asia, israel, or the arab world? Do kids that speak english communicate more in english in cases where the input doesn't let them communicate easily using their preferred script? Are there new "hybrid" languages popping up in electronic communication where languages that use non-latin scrpit are written in latin in e.g. text messages? (You could argue that emoji is just that but the other way around I suppose)
It's the same for me when writing Swiss-style German. It's not only cumbersome to type the special characters like ö on a cell phone, also spell-check is a pain in the ass. For example, spell-checkers are not aware that in Switzerland, we do not use the ugly German ß and just write ss instead. So I get false corrections all the time. Furthermore, spell-checks usually seem to be designed for languages like English with only a limited number of flections. English only has singular and plural, but in German, the ending of a word can be bent in much more ways, making it rather unlikely that the spell-checker suggests the one you want. I suspect that most spell-checkers do not know that "car" and "cars" are the same word in a different form. Instead, they add each form to the dictionary individually. This works well for English, but for many other languages awareness of this would be helpful. Another detail is that expressions that are short in one language can be long in another and vice versa, so many direct translations of English words are much longer in German, taking more space on buttons and other UI elements, thereby screwing up the layout.
So yes, even when "localized", hardware and software is usually not fully adjusted to the language. And for German, which is relatively close to English, the problems are probably relatively small. I cannot even fathom how big all these issues for more distant languages must be.
And to answer you questions: yes, these pain-points lead me to sometimes prefer English and to avoid special characters. I even avoid spaces in file names as too many programs I worked with in the past struggle with that.
> Furthermore, spell-checks usually seem to be designed for languages like English with only a limited number of flections. English only has singular and plural, but in German, the ending of a word can be bent in much more ways
Oh, and I always say German works still great. In Finnish words have many 100s of different endings (several concatenated + combinatorics explain how this is possible) and predictive inputs just don't work. Probably it could be improved with massive semantic support, I am not an expert in that field.
For Hungarian, people stuck with an English layout usually just leave off the diacritical marks (áéíóöőúüű -> aeiooouuu). While this theoretically leaves some of the meaning ambiguous, and pedants can craft examples that may be ambiguous even with context, it works well enough in practice.
Switching to English is way overblown a reaction. Two Hungarians chatting in English (unless there are non-Hungarian speakers involved) seems extremely weird to me. It may be partially that English is really foreign for us, while it's pretty close linguistically to Swedish, both being Germanic.
Hungarian is really foreign to EVERYONE :)
Me switching to english is in the context of work, not chatting with peers in my free time. At work in tech everyone* (few exceptions) communicates in english (emails, bug reports etc) also between Swedish colleagues, so it's very natural.
I wouldn't send a text message to my wife in English if I happened to struggle with the diacritics on the device I'm on.
> Switching to English is way overblown a reaction. Two Hungarians chatting in English (unless there are non-Hungarian speakers involved) seems extremely weird to me. It may be partially that English is really foreign for us, while it's pretty close linguistically to Swedish, both being Germanic.
For what it's worth, I have a Swedish friend who almost always posts in English, even when he's talking to his family and other people he knows in real life. He'll switch to Swedish occasionally (and Facebook's autotranslation is very understandable), but 90% of the time he uses English.
Yeah in swedish leaving them off isn't working. The diacritics aren't for accentuation, they are distinct letters. An ö is as different from o as u and e are.
Same in Hungarian. One funny example is "főkábel" (fő+kábel, main cable) vs. "fókabél" (fóka+bél, seal intestine). Still, the intended meaning is almost always easy to guess.
Hungarian is also redundant enough that you can even replace all vowels with just one and still be understandable. Most of the meaning is carried by the consonants. E.g. "Szia én vagyok Péter, hogy vagy?" -> Szii, in vigyik Pitir, higy vigy. (sounds obviously wrong, but very understandable) Retaining the vowels but collapsing all consonants to one would be more destructive to the meaning.
Uh, what do you mean? Leaving them off and letting the reader guess the intended word works pretty well in practice and is what many organisations in Sweden resorted to when getting a domain name (ex: riksgälden, åhlens, företagarna).
I can only say about CJK with an example: the web version of Twitter has frequently omitted the last character of CJK messages yet to be committed, and people seem to complain a lot but have adapted somehow. I think this is partly why it doesn't get fixed in time---it is frustrating but not a blocker. [1]
[1] And when the affected user base is small enough, even larger problem can take a lot to fix. I use a third-party IME that is relatively known among programmers, and early this year Google Chrome began to crash seemingly at random when you type Hangul. I tracked the root cause and I believe it is ultimately Chrome's fault, but I was too busy to file an issue at that time and the IME has adapted to Chrome to avoid triggering the crash next month. I believe Chrome still crashes when older versions of the IME are in use.
> I use a third-party IME that is relatively known among programmers, and early this year Google Chrome began to crash seemingly at random when you type Hangul
I'm aware that third-party Chinese and Japanese IMEs are a thing, but what user-facing difference does the third-party IME you mention provide in the Hangul context?
- Support for less-popular or customized keyboard layouts. From time to time popular OSes had a wrong version of supported layouts as well.
- Deterministic autocorrection. For example, normally an initial jamo (e.g. ㄱ) should be followed by a medial jamo (e.g. ㅏ) and not vice versa, but many IMEs offer a feature that automatically swaps them. With a carefully designed layout this can catch lots of transposition typos.
- Custom candidates for special characters, as Korean IMEs show special characters as candidates when a jamo is being "converted" to Hanja (popularized by MS IME).
The IME in question [1] also allows an extremely customizable input system, to the extent that it forms a soft of wholesale DSL for Hangul IMEs.
In Israel I've found that Hebrew speakers exclusively write in Hebrew. Aside from letters, there's nothing else needed. One CAN add vowels, but only children's books have vowels so we're used to reading without them.
I can speak a little on Chinese and Japanese. In neither of these countries, English (or sometimes even Latin) literacy is high enough to replace the domestic language. (And anyway, maybe it's just me, but reading romanized Chinese or Japanese is exhausting.)
With China, the market's so big that there's an entire ecosystem of homegrown (or Chinese-modified) software and websites, everything from office suites to web browsers, and I have to imagine those support Chinese input just fine.
With Japanese, generally the same software and sites are used, but Japanese support is also pretty good across common software.
Some niche or open source software struggle with IMEs in general, but in those cases, the solution is just to not use that software.
Not that this addresses your main question, but for most latin script languages you'll be fine with always using the altgr-intl variant of en-us.
As for mobile, you can generally find the language-specific ones when pressing and holding some character when you're on a normal qwerty keyboard.
For Scandinavian:
å: altgr + w
ä: algr + q
ö: altgr + p
æ: altgr + z
ø: altgr + l
Should be available on all OSs. I never have to switch layouts when switching between languages. Still struggling with finding an actually nice IME setup for Japanese and Chinese, though, and I've spent some time...
On Linux: setxkbmap -layout us -variant altgr-intl
Regardless, I really think input methods should be kept out of scope for all apps. There's nothing I hate more than app developers trying to "solve" it for me. There are already people who work full-time on this and users who have their systems set up as they want. There's no way an app developer is going to help more users than they break things for.
The only exception I've seen is Google's web CJK input for Translate, but they fall under the "huge amount of resources" category.
Context: I write in Swedish and Japanese on a daily basis, spent some time learning Chinese and sometimes have to retype things from other languages. Using vanilla iOS and Android on mobile, Linux on desktop. CJK has been the only struggle so far. There's enough tech-literate people with those native languages that there definitely are ways to get good configuration for any OS, for Linux you just have to figure out how, which can be tricky to find if you don't read the language.
I don't even have an AltGr button on my keyboard (!). It would be very cumbersome to write swedish with AltGr though, as the characters are quite common.
Not even a right Alt key - should be the same if you set the layout..? That's a really minimal keyboard, all 60% keyboards I've seen and even some 40% keyboards have that... I'd actually be curious to see your bastard keyboard!
YMMV, but I got used to it pretty quickly and I type Swedish all the time, way less annoying than doing the Alt-shift dance to switch layouts whenever I go from coding to chatting at least ;)
I always have an AltGr key, because I tend to program on Finnish keyboards. (Identical layout as Swedish, but because nobody came up with a term to group those 2 together, there seem to be always 2 options to chose that make no difference...) What I really hate about AltGr / this layout is the curly braces and square brackets. It makes programming so much harder to need to use AltGr for each of them.
That said in Linux/X11 you could configure everthing to your likings. I must admit I prefer complaining...
Yeah, this is exactly the reason I stopped using the Swedish/Finnish layout altogether. It must be hundreds or even thousands of time I accidentally typed :( instead of :) in IMs because someone at some point decided the parentheses had to be shifted one step...
Right-alting for åäö is just way less cumbersome than doing the same for @${}[]\~| or having to switch layouts when switching contexts.
What are the cases when your keyboard is stuck in English?
In Ukraine, no, we don’t. Kids don’t generally know English enough that they would want to speak it with peers, and there are simply no reasons to do so.
“Translit” (writing Ukrainian/Russian with English characters) was a thing in SMS days, when typing in Cyrillic would shrink your SMS length in half (because that’s how SMS encoding worked, I guess), but now practically nobody uses it anymore.
> “Translit” (writing Ukrainian/Russian with English characters) was a thing in SMS days, when typing in Cyrillic would shrink your SMS length in half (because that’s how SMS encoding worked, I guess), but now practically nobody uses it anymore.
Ok that's exactly the type of thing I was thinking about. Funny it was born from the limitation of 140 chars rather than from actual input difficulty though.
A dude is still sending me messages in translit over the web every time he's in foreign lands, even though a language is added to the OS' switching set in a few clicks these days.
But back in the 90s to early 2000s, translit was pretty big on the internets due to the dozen of different Cyrillic codepages and some software supporting only 7-bit characters (specifically Fidonet exchange software and newsgroup nodes).
Just the fact that I still to this day run into mis-encoded diacritics in movie subtitles, shows how ‘US first’ ASCII caused a lot of headaches for many years.
(BTW: you probably have been told by now, but your nickname is a common slang word in Russia.)
In India, it was definitely born because of input difficulty, its such a big thing that It has given rise to Hinglish (Mixing Hindi and English) plus translit is the only way 99% of people type here
As a Swede I use a US keyboard layout and a compose key as you've described. alt gr followed by o and " for example. It was cumbersome to write Swedish fluently for a few days, but after that it felt natural. A good boon is that now I also don't have a problem typing a bunch of characters common in other Latin-like scripts like ç, ø or ß.
That said, I type mostly in English anyway. My reason for using a US layout is that it is more ergonomic for bracket/brace/semicolon heavy programming languages. Braces in particular are an annoying chord with Swedish keyboard layouts. Especially on OS X, which for whatever reason uses three part chords (alt+shift+8 or 9).
> Are there new "hybrid" languages popping up in electronic communication where languages that use non-latin scrpit are written in latin in e.g. text messages?
Yeah, transliterating arabic to english is a thing:
> With smartphones, the relationship between informal, chatty Arabic and formal written forms has become more complicated, giving rise to a hybrid known as “Arabeezy” – or Arabic written with Latin characters and numbers to represent letters that have no English equivalent.
In Sweden and Finland people typically prefer to omit the diaeresis without adding a trailing e. Due to Germany getting their way, though, the American-readable part on passports uses the German-style conversion to ASCII.
> Do kids that speak english communicate more in english in cases where the input doesn't let them communicate easily using their preferred script?
As the input method of CJK languages are significantly different from English (compared to the relatively small difference), every app just allows the locale's input method.
Nobody tries to communicates with only the English keyboard, b.c that is outright impossible (while in Swedish it's possible in worst cases).
As I said I wouldn't use Swedish language without a Swedish keyboard, even though it's quite possible (just 3 missing characters basically). What I'd do is switch to english language instead.
> What I'd do is switch to english language instead.
Hmm, I've never encountered situations when only English keyboard is available for communication (except for when someone is setting up a new Linux machine).
But even if there is such situation, I don't think that would happen, I'm not sure if Swedish is similar to English but English is cognitive load here. Even with people that are proficient in English, I cannot imagine myself communicating with the English language.
I remember myself resorting to online keyboards though(while setting up Linux).
In industry in Sweden, especially in tech, english is the language used. If you have a team of 10 Swedish developers working on an app, they will most certainly write all user stories/bug reports/specs/code comments etc in 100% english. The 11th person to join might be an english speaker so you can't afford to have your backlog or code in Swedish. The step to using english for email/chat then is very small.
> Do kids that speak english communicate more in english in cases where the input doesn't let them communicate easily using their preferred script? Are there new "hybrid" languages popping up in electronic communication where languages that use non-latin scrpit are written in latin in e.g. text messages?
Both, in my part of India. The latter used to be much more common in the feature-phone era, writing in Tanglish (Tamil words+English script), but seems to have become much less popular now. Not sure why, perhaps swype-type keyboards have made English so much easier to type.
The former certainly still happens a lot - I know a bunch of people (including myself) who would prefer to communicate in Tamil, but there's enough friction (not having a Tamil keyboard, having to get used to new keyboard layouts, the Tamil keyboard being buggy often enough, ...) to make it more appealing to go with English for normal conversation.
AFAICT, Swedish keyboard has a dedicated key for ö. You just press it. I don't understand how an app can get stuck on keyboard layout if you set the system keyboard layout to Swedish. Could you give an example?
Or are you're actually using US keyboard with Swedish layout and there is no ö key on it?
There are multiple things in play here: keyboard layout which is slightly different for a swedish keyboard https://en.wikipedia.org/wiki/QWERTY#Swedish and of course keyboard language which maps the key pressed to a character. You can mix and match those.
Lacking the Ö key doesn't matter so long as pressing the key that is where the Ö key should be actually produces an Ö (which is the case e.g. with a US keyboard layout but a Swedish keyboard chosen in windows).
> Could you give an example?
It was a hypothetical, but in windows you have a per-app layout (so you can choose multiple keyboard layouts in windows and an icon in the systray will let you choose the active one) and that one often spuriously switches so suddenly you hit the Ö key and it produces a ';'.
A better example is perhaps when I'm travelling and composing an email from someone elses computer at my office in Britain back to my office in Sweden. I'm not going to bother switching the windows settings just to write the email on that machine.
Or if I'm logging on to a VM with only english layout so hitting my Ö key will produce a ;
If I needed to type anything inside that VM desktop for whatever reason, I wouldn't bother trying to write Swedish even if the person I was typing to was Swedish.
> Or are you're actually using US keyboard with Swedish layout and there is no ö key on it?
Personally I actually do use a US keyboard wit ha Swedish layout, which works fine. If I hit the key labelled ';' it will show an Ö on screen. The reason is I wanted a specific type of keyboard that didn't exist with Swedish layout, and the layouts are almost equal. It's just one missing key between lshift and Z that doesn't exist in US layout.
I can second that particular grief in windows. I use english input for most things but spanish for a few; it will randomly switch. This is doubly aggravating when I am in, say, vim and it messes everything up before I figure out why things aren't responding. Or, I go to put in a `[`, get nothing, press again and get `''`, or press a key and get `á`. Let alone if I'm trying to type programming-related chars in my spanish keymap (taking notes in markup syntax etc.); I then have to remember modified positions of keys (parentheses are the worst). Dealing with essentially all non-char keys getting hijacked is very confusing; just writing this, I have made several mistakes.
Honestly, English is uniquely suited to computer input. A narrowly-defined charset maps well to a situation where you can only fit so many keys on a keyboard or on a screen. I also sympathize with difficulties implementing languages that have ligatures, right-to-left reading, or other significant differences. Non-discrete chars just don't map well to a world of ones and zeroes, because representing all is the simplest option but has O(n!) complexity in such languages (assuming you can combine all, which is probably not possible, but you get the point). I have a great deal of sympathy for maintaining such complexity for what can be a very small part of your user-base (that is can communicate in only one foreign language).
I assume this is about having to use other people’s devices, which aren’t configured how you’d configure your own.
I personally have muscle-memory for my own programming-optimised adaptation of Type 2 Dvorak, regular German layout, and UK/US/International variants of English keyboards, each in Mac and PC variants. And, I suppose, in mobile phone and tablet screen-keyboard variants.
It always takes a few seconds to adapt to whatever I’m sitting in front of, and I have a lot of practice with this compared to the average person.
At least in German, there are standard substitutions for when umlauts and sharp S aren’t available for technical reasons:
ä -> ae
ö -> oe
ü -> ue
ß -> ss (historically “sz“, which you still occasionally see, particularly where there is Hungarian influence; Swiss German tends to forego ß altogether)
It happens to my wife's German USB keyboard in Windows 10. One day it's working fine, the next day pressing Z outputs a Y and Ä outputs a semicolon. I've had tons of problems with Japanese input in Linux and Mac (but not in Windows).
Multilanguage support is a function of how many customers demand it, and aside from English and Japanese, there hasn't been enough economic incentive to take it seriously.
The keyboard sends a scancode, and something, either the text input of the OS, or the app themselves, translates that to an ö. Second, it is possible that the system just does not load the correct keyboard layout and defaults to US, so you loose your familiar keyboard layout precisely when you already have other problems.
Parts of the system hardware might also do scancode to text transcoding. When PS/2 to USB keyboard convertor cables were common, I've seen quite a few having trouble with keys not being correct for non-ASCII characters, especially when using Alt GR
There isn't text coding. The problem stems from the Microsoft-centric design of USB modifiers — there is no true AltGr code, so they have the convention (at the OS level) that left Alt is always Alt and right Alt is sometimes AltGr. A few early converters didn't grasp that left and right versions of the same modifier key could have different meanings, because nobody in their right mind would do that.
I know in Arabic people sometimes write in a way intended for telegraphy when they use phones and don’t want to or can’t switch IME (the main thing I remember seeing is 3 being used as a letter.)
While spending a lot of time with monospace fonts and mostly ASCII characters (programming and writing, terminal emulators, IRC/mail/MUDs/feeds in Emacs) and working on a hobby project involving text rendering and selection (with potentially proportional fonts and Unicode), I keep wondering whether it's even worth all the trouble.
In addition to what's mentioned in those articles, on larger documents rendering speed also matters; word cache (in addition to mentioned glyph cache) helps somewhat, but complicates other things even further, and calculating total document height (for scrolling) still requires to calculate all the positions, wrap the lines, etc, which in turn requires to render all the text first. Perhaps one can introduce yet another mechanism to only estimate that at first, and then elaborate the estimation in background, but that's yet another opportunity for bugs to creep in (and additional complication, of course). On the contrast, the programs that just use character grids (and perhaps don't handle bidirectional texts well) can be both faster and simpler, while still are quite capable of rendering texts, even with various styles/decorations and occasional images, in many languages.
There seems to be plenty of accidental complexity even in handwriting, but once it is used for computing and text manipulations are involved, the task appears to be much harder than it has to be.
> While spending a lot of time with monospace fonts and mostly ASCII characters (programming and writing, terminal emulators, IRC/mail/MUDs/feeds in Emacs) and working on a hobby project involving text rendering and selection (with potentially proportional fonts and Unicode), I keep wondering whether it's even worth all the trouble.
Yeah... and that's why FireFox & Eclipse still doesn't get CJK character input right.
The percentage of people that must use characters not included in ASCII is much bigger than people who do.
It is worth the trouble, please consider users outside of the US & England.
I think it would have been better to have every body just learn english as their second (computer usage) language.
I'm german and my english was horrible until I switches my computer and my websites to english ones.
Imagine a world where everybody speaks at least one common tongue, we could have so much more peace and understanding.
English becoming the esperanto of the world is an accident but I don't care which language it is, as long as it has less than 255 letters. (Hawaiian would have fit into a nibble which is cool)
> better to have every body just learn english as their second (computer usage) language
Why not Chinese? It would have many advantages:
- Text layout would be much easier (Chinese fonts are mostly mono-space).
- In addition to the first point, Chinese characters can be easily used for vertical text. Looks much nicer in narrow, tall buttons than latin letters.
- Chinese characters are (mostly) pronunciation independent. No they/there/their anymore.
- It has been shown that dyslexic children have less problems recognizing Chinese characters than latin letters.
- Chinese characters are part of the culture heritage of billions of people (both Chinas, Japan, Korea) and not just something they learn at the university.
- Old-school C enthusiasts can now have meaningful variables names while keeping their self-imposed three-character limit (or was it four?)
- Finally, everybody will use wchar_t
- Edit: But there is one disadvantage. There is no Chinese character for "/s" which some people here seem to need
Logographic writing systems are just a pain in the ass. You have to create huge fonts, you waste ridiculous amounts of space, and you need phonetic descriptions of new words anyways.
Like I said, Hawaiian would be cool since it fits inside a nibble with 16 characters, although the words tend to get long.
Considering how the brain seems to process words when reading ('glancing in context' more than analyzing each letter), I'm pretty sure the length of words does not make reading much longer. Writing is another matter but with auto-predict you could probably shove half the typing off on average.
As a software engineer and cousin of Stitch, Hawaiian gets my vote! :handclap:
> I think it would have been better to have every body just learn english as their second (computer usage) language.
Hmm... it might not be well known to the western world, but AFAIK most parts of the world learns English as their second language.
At least East Asia learns English pervasively throughout your life. We start learning English when we're five.
But still, English is not the only language, and we want to communicate with other people.
Just that people can use English doesn't mean they prefer English & Latin characters to communicate.
You can't just er... force people use Latin characters for the sake of programmers' comfort.
Will you use your terminal if the terminal demands you to press space & backspace everytime when you're typing in English characters?
I know that many parts of the world including Asia learn english, thats why I think it should be used as the langua franca, and thats why I said, that I don't care which language it is, as long as it doesn't have weird ligatures or a large amount of characters.
For what it's worth I'd remove all characters that are not from that "computer language" including the german ones.
This is more than programmer convenience, this is about system simplicity, absolute correctness (too many bugs because of unicode and complex layouting) and above all about _forcing_ people to use said lingua franca to become comfortable speaking it.
If you want your own writing system, fine, but put the effort in and build it yourself apart from the common bug free code base that the world shares, it should be less buggy overall anyways, than a system which tries to speak every language and support every writing system.
> This is more than programmer convenience, this is about system simplicity, absolute correctness (too many bugs because of unicode and complex layouting) and above all about _forcing_ people to use said lingua franca to become comfortable speaking it.
The reason the writing system in computers are simple is because the latin characters are simple, and the majority of early computer users (which the majority of was English speakers) didn't feel the need of sophisticated systems.
I (often) find that complexity that is inside the western culture is warranted, while complexity not in western culture is not.
While not a great analogy, think of TTS. English TTS is (at least from what I have heard) not that easy (encodable in logic, but not as simple as mapping characters to audio). That resulted in some sophisticated TTS systems, where it can handle lots of different phonetic structures.
Compare this to Hangul (the Korean character system) where every character is composed of several mini-characters which have a unique mapping to audio. Basically for a minimal viable product you can just convert text to decomposed format(NFD) and map the char points to audio.
Now, let's say that some company (an arbitrary choice, Facebook) decided to make a global TTS product, first in Korea. They just decided to map the char points to audio and post-process the audio. This TTS architecture obviously can't handle English without some major architecture changes. Then Facebook decides, this is a problem in English where the phonetic structure is unnecessarily complex, and won't provide TTS service to English users.
This is something similar to what CJK users face, that people just won't make products that work reliably with CJK.
As I already argued in a different comment, I find Hangul to be more elegant than latin, even though I think that the syllabic block notation would add unnecessary complexity to an implementation.
The thing with Korean is that "nobody" speaks it. Had early computers been developed in Korea and would everybody speak Korean I would argue now that everybody should just learn Korean. But tough luck, it didn't. Our best shot at a lingua franca and simple computer systems(simple in the sense; as little complexity as possible) is with english.
I can understand (but don't fully agree to) a claim like that a F/OSS software should be primarily in English to maximize its community, as many programmers already speak English to some extent out of necessity anyway. But forcing English to ordinary people not using English? Plain absurd.
For starters, while young people may feel more comfortable with occasional English older people don't, and your proposal will make them much more uneasy. Smartphone penetration rate around the world [1] can easily reach 50% for many countries and that would include many people unable to understand English---I live in Korea and older people are technologically no inferior to younger people in that regard. Impossible when you force them to use English.
> I live in Korea and older people are technologically no inferior to younger people in that regard.
You guys truly are wonderful unicorns. How do you do it? (serious question in that regard)
On topic, I think the ultimate solution is to have whatever language input/output be essentially a placeholder, and then some general "switching" system which puts whatever language a user prefers. It should be able to switch on-the-fly totally independently of application state. (same thing for numbers, outputting decimal or hexadecimal should happen on-the-fly, same model value but different user view)
Commercial OS's tend to be pretty invasive these days, but the one thing where we need more of them, and better tooling, is definitely text i/o.
Adults do learn languages well when required, sure. I question that we should. Making additional [1] billions of people learn a new language is not cost-effective compared to a (comparably) small number of software engineers wrestling with i18n (haha). It is sad that English knowledge is very valued among, for example, several non-English [EDIT] workplaces even when it is not required at all.
> Hangul would also be a nice lingua franca writing system, too bad its very local
By the way, Hangul was tailor made for Korean with its relatively simple CGVC syllable structure. I doubt it can be generalized much---Hangul is important because it is one of the first featural alphabets, not because it is a universal (or at least adaptable) featural alphabet where it isn't.
[1] English has about 2 billion speakers (400M native + 750M L2 + 700M foreign), so you need another billion speakers or two to make your proposal real.
[EDIT] was "CJK", but I'm not sure about Chinese and Japanese.
Instead of the total number of speakers the better metric would be the diversity of the speakers.
A language is a better candidate for a lingua franca if its spoken everywhere a little, than somewhere a lot, e.g. Mandarin.
And this is not about cost, but about Bugs in critical infrastructure.
I'd get rid of smartphones and unicode in a heartbeat if it meant bug free command line applications, no malware, and medical and infrastructure that doesn't crash.
I think its great when non english workplaces require english. It gives people an incentive, because humans are a lazy species. And once you learned it you can use it on your next vacation, your next online encounter, and who knows where.
Having computers only do english would have given a similar incentive, and I'm sad we didn't use that opportunity.
English is the lingua franca, but not more than that. Most communication is still done and will always be done in local languages, and computers should work for them.
They wouldn't be forced, as if they want to use a non-ascii language then they can develop it themselves or pay the absurd amount of money needed to support their language. We have no obligation to give hundreds of thousands of dollars of our time to people for free.
English is a germanic language. You say "gut" instead of "good", "zu" instead of "to", "wir" instead of "we", "Wetter" instead of "Weather", "nichts zu tun" instead of "nothing to do".
You will not be doing as much effort as someone from a totally different native language. The CIA calculated that it requires over 4 times the effort to learn Chinese for an English speaker than learning French. It also applies back, a Chinese will find it very hard to learn proper English.
Learning English for a german student is as easy as taking the train and spending some days in England.
Now for most Chinese, or even Indian(they have more than 20 languages) people, or Arabic,it is not so easy.
The fact is that they are in a disadvantage,not in a leveled field and they don't like it.
They are the majority of the population in the world, so if they develop their economies they will force everybody else to learn their native language, not the other way around.
By then everybody will speak enough english that it won't matter. The economies of the network effect are quite simple.
And of course I'm at an advantage, the very fact that I'm german would put me at an advantage even if english was as far away as possible from german, because I have the resources, I have free education and I have free medicare, plus 400 years of the subjugation of the third world in my back. So what. Life isn't fair, the worlds top 30 People own as much as the bottom 3 Billion, thats unfair.
Instead of bickering about which language would be fairest to adopt you should help make everyone adopt SOME language so that they can organize and exchange. So that they can understand for themselves what other are saying instead of having to rely on potentially biased news. So that they can talk to each other, for sympathy and empathy and understanding.
That would mean throwing away a heap of literature (books, songs, etc) and legal documents, which are hard to translate respectfully. And english isn’t an excessive language itself, for better or worse. Also, the idea that switching to esperanto-english subset would lead to more peace and understanding seems at least debatable to me.
At the bottomline I find it amazing that a human managed to draw symbols in columns or rows in any direction for centuries, but computers still have trouble with it. Maybe we should blame computer makers, not languages?
Not everything needs to be digitized. My life was happier when bookstores were still around. And when I had paper forms to fill for bureaucracy, because it meant that there was human compassion and wit involved, something that is required for fairness. A digital application of laws does the law no justice.
As for writing direction:
Simple tools allow for complex creation.
Complex tools allow you to only create simple things.
> I think it would have been better to have every body just learn english as their second (computer usage) language.
That seems a little extreme to me. It seems like it might have been better to just adapt writing systems for computer usage. Originally a lot of computer systems didn't have lower case letters, japanese tended to just use katakana, etc. Sure, it kinda forces everyone into an alphabet or syllabary, but language adapting to medium is hardly a new thing.
While it sounds nice to me too (and neither am I a native English speaker), apparently there was a bit of misunderstanding in another comment, and possibly here too: I rather wonder about a hypothetical more efficient (and computing-oriented) writing system, not just/necessarily/only using monospace latin characters. It doesn't seem very realistic that something like that would happen (judging by the mess that are date formats and units of measurement, which seem much easier to fix), but observing the complexity that doesn't come from fundamental constraints, it's hard to not wonder about that.
The story about line wrapping is mainly that, you:
· Break the text into runs.
· Calculate BiDi at paragraph level.
· Shape every run to get its length and possible line breaking points.
· Arrange the runs (now only have length, height, text isnide it and some breaking properties — glyphs and glyph positions are dropped) into lines.
Such process is called “measuring”. And the real text display is delayed until a line become visible: you fetch out the text runs in the line and do the real display.
In most times (like appending text) you do not to do much recalculation.
Have fun in your anglocentric world then. I don't think you have sympathy towards users for whom monospaced fonts do not work for their language, ASCII can't encode all the code points in their language.
Rendering speed hasn't mattered since years ago. Browsers can render megabytes of pure text (hours of reading material) in seconds. LaTeX can typeset hundreds of pages in seconds.
> I don't think you have sympathy towards users for whom monospaced fonts do not work for their language,
I do, that's one of the reasons I'm trying to handle Unicode and proportional fonts in the first place. Given the circumstances, it is indeed preferable to do so, but given that it's basically encoding of approximated sounds (well, potentially information in general) on a 2D plane, it is a rather complicated (both code-wise and computations-wise) way to achieve such a task.
> Rendering speed hasn't mattered since years ago. Browsers can render megabytes of pure text (hours of reading material) in seconds.
Chrome only started using "complex path" for English texts a few years ago, employing mentioned word cache. Rendering megabytes of text in seconds is indeed achievable (especially if the words are cached, and it's not megabytes of integer sequences), but still complicates the rendering, and still slower than to open and navigate through the same megabytes of text in, say, less or vim, or even links -g. It's not an issue if you don't have to do it (or to use a program where it wasn't optimized well), but if you do, it is.
I've lightly advocated for a while that emoji shouldn't be part of the Unicode standard at all. I'm sure there are things some advantages, I'm sure there are other considerations I'm not thinking of, but it just seems like a really bad idea to stuff the Unicode standard.
I don't know the official name or who came up with it, but I use Slack's entry format exclusively in every application.
If the application can detect that as an emoji and swap it out, fine. If it can't, I don't change my format. My preference would be if applications left emoji in that format, and just rendered them differently at display time.
The advantage of having emoji just be a purely clientside rendering feature, and behind the scenes all fall back to normal text is:
a) they can be easily aliased across multiple languages (:cat: :gato:)
b) if you paste an emoji into an application that doesn't support them, you don't get an unrecognizable character. Progressive enhancement!
c) it's accessible when copied and pasted into a pure UTF-8 text format. It's just better blind-accessible in general.
d) it's more forward compatible. I can use :cthulhu: right now without waiting for it to get added to the standard.
e) get rid of modifiers. Like, seriously, just get rid of them. Emoji aren't programmatically generated, you still need to draw one image for each modifier combination, and you still need to program support for each one. So, what's the advantage of using modifiers over just adding multiple glyphs? They're just there to save space in the character list, which is only a problem because emoji are in Unicode. :smile: :fake_smile:
f) better support for custom emoji in general. Basically every platform has custom emoji, and it's weird because half of your emoji are standardized and half aren't. And then whenever new emoji get added to the standard, if they conflict with your custom emoji your app breaks.
I would hesitate to standardize emoji at all, beyond having a consortium that says, "this is what :thumbsup: means, you can extend on top of this as you see fit."
It feels like extra complexity for no benefit other than, "we need a standard".
I think this works great for apps like Slack, in user-land so to speak, but isn't realistic for the Unicode standard, not only because these entry formats are in English.
Modifiers and combinators aren't exclusive to emojis, but apply to all kinds of glyphs in other languages and writing systems as well. Arabic script even has some common ligatures for common expressions.
A lot of complexity simply doesn't stem from emoji in Unicode, a lot of the complexity comes from all the writing systems that Unicode supports. Admittedly, emoji are kind of an oddball addition to Unicode, but they're by far not the most complex part of it.
And even in Slack/IM apps, custom emoji codels only “work” because people aren’t often trying to 1. interoperate with external services using, or 2. parse archived logs of, arbitrary message text.
If either of these were common (e.g. slack bots that tried to parse semantic meaning from regular text rather than responding to commands; or Slack logs of OSS communities being public-access on the web) then you’d see a lot of people up-in-arms around the fact that these custom codels are used.
But since text in these group-chat systems is private, ephemeral, and mostly a closed garden, it never bubbles up into becoming an issue anyone else has to deal with.
(Though, on a personal note, I wrote my own VoIP-SMSC-to-Slack forwarder because Slack is a much better SMS client than any of the ones built into VoIP softphone apps, and I’m irritated every day that Slack auto-translates even Unicode-codepoint-encoded emoji from a source postMessage call, into its own codels in the canonical message stream. I don’t want to send my SMS contact “:thumbs_up:”, I want to send them U+1F44D!)
Think of Unicode like HTML. What’s better for interoperation and machine-readability: a custom SGML entity (like you could use up through HTML4); a custom HTML tag; or a normal HTML tag with an id/class attribute that applies custom CSS styling?
One way to encode a ‘custom emoji’ would be encoding it as a variation of some existing emoji. Use an as-yet-unused variation-selector on top of an existing emoji codepoint, and then “render” that codepoint-sequence on receipt by the client back to an image (but in a way where, if you copy-and-paste, you get the codepoint-sequence, not the image. In HTML, you’d use a <span> with inline-block styling, a background-image, and invisible content.) This is pretty much what Slack was doing with the flesh-tone variation-selectors, before Unicode standardized those. But you can do it for more than just “sub”-emoji of a “parent” emoji; you can do it to create “relatives” of an emoji too, as long as it’d be semantically fine in context to potentially discard the variation selector and just render the base emoji.
Or, if your emoji could be described as a graphical (or more graphical) depiction of an existing character codepoint, you could just use the “as an emoji” variation-selector on that codepoint.
Or, rather than a variation-selector, if you have a whole range of “things to combine with” (i.e. the possibilities are N^2), you could come up with your own private emoji combining character for use with existing base characters. The “cat grinning” emoji U+1F639 could totally have been (IMHO should have been) just a novel “face on a cat head” combining-character codepoint, tacked onto the regular “face grinning” emoji codepoint. Then you could have one such combining-character for any “head” you like! (And this would also have finally allowed clients to explicitly encode “face floating in the void” emoji vs. “face on a solid featureless sphere” emoji, where currently OSes decide this feature arbitrarily based on the design language of their emoji font.)
And, I guess, if all else fails, you could do what Unicode did for flags (ahem, “region selectors”), and reserve some private-use space for an alphabet of combiner-characters to spell out your emoji in. That way, it’s at least clear to the program, at the text-parsing level, that all those codepoints make up one semantic glyph, and that they are “some kinda emoji.” Custom-emoji-aware programs (like your own client) could look up which one in a table of some kind; while unaware programs would just render a single unknown-glyph glyph.
I don’t suggest this approach, though—and there’s a reason the Unicode standards body hasn’t already added it: it’d be much better to just take your set of emoji that you’re about to have millions of people using (and thus millions of archivable text documents containing!) and just send them to the Unicode standards body for inclusion as codepoints. Reserving emoji codepoints is very quick, because the Unicode folks know that the alternative is vendors getting impatient and doing their own proprietary thing. Sure, OSes won’t catch up and add your codepoint to their emoji fonts for a while—but the goal isn’t to have a default rendering for that character, the point is to encode your emoji using the “correct” codepoint, such that text-renderers 100 years from now will be able to know what it was.
So, please, just get your novel emoji registered, then polyfill your client renderer to display them until OSes catch up. Ensure your glyph is getting sent over the network, and copy-pasted into other apps, as the new Unicode codepoint. Those documents will be correct, even if the character doesn’t render as anything right now; if the OS manufacturers think the character is common (i.e. if it ever gets used in text on the web or in mobile chat apps), they’ll provide a glyph for it soon enough. And, even if the OS makers never bother, and you’re stuck polyfilling those codepoints forever, there’ll still be entries in the Unicode standard describing the registered codepoints, for any future Internet archaeologists trying to figure out what the heck the texts in your app were trying to communicate, and for any future engineers trying to build a compatible renderer. (Consider what Pidgin’s developers went through to render ICQ/AIM emoji codels. You don’t want to put engineers through that.)
> A lot of complexity simply doesn't stem from emoji in Unicode, a lot of the complexity comes from all the writing systems that Unicode supports.
Yes, it's not that emoji are doing anything odd compared to lots of real world languages, it's that emoji are just latin script writers' "first"/"only"/"most likely" interaction with that sort of stuff. The fascinating bit that that if it weren't for emoji a lot of these problems would still go unfixed in a lot of real languages, but because emoji are fun and everyone wants to use them we've seen a lot of Unicode fixes brought about by emoji that's a rising tide to lift other Unicode boats.
> Unicode was designed to provide code-point-by-code-point round-trip format conversion to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation [1]
The problem is that emoji are part of a major text-encoding format, so unicode needs to adopt those. Once those were added, everyone and their mother suddenly wanted their emoji in unicode as well.
> It feels like extra complexity for no benefit other than, "we need a standard".
That sums up the whole standard for better and for worse.
That’s exactly the benefit. Think of Unicode as “Archive.org for the semantics of text codels.” Every time someone invents a text codel (like those examples you gave, where Slack invented their own text codels), Unicode takes the semantics behind that codel and standardizes their own codepoint equivalent to it, so that Unicode documents will have a way of encoding that text codel at the Unicode-text level.
If Unicode doesn’t do this, then people have to use other encodings on top of Unicode to specify their text; they come up with incompatible encodings of the same semantic characters; and suddenly we’re back to having to create code-pages to specify what set of incompatible codels each text stream is using. It also means that we go back to having to create OS-specific, or GUI-toolkit-specific rendering encodings to translate those code-pages into a text-layout-system specific “normalized” encoding; and thus, we also go back to having OS/GUI-toolkit specific fonts.
> Emoji aren’t programmatically generated.
No, not usually, but they can totally be programmatically consumed. They’re machine-readable! The modifiers allow emoji to be “structured text” in the ML sense. Since there’s one codepoint that always means “sad face”, an algorithm can attach a meaning to that codepoint apart from its modifiers, to do e.g. sentiment analysis. It’s much harder to learn when you have 100 different “sad face” codepoints; let alone when different documents use different incompatible encodings with different codels to refer to the same “sad face.”
> better support for custom emoji
That’d be like saying a dictionary should better support private in-jokes. Unicode, like a dictionary, watches for when things become common or important to define, and then defines them. Until then, it considers them “someone’s attempt at injecting a gibberish in-joke into language.” In-jokes don’t need an entry; but as soon as an in-joke becomes a word, it does. Because people don’t look up in-jokes, but they do look up slang (= ascended in-jokes), so if you want your dictionary to be useful, you’d better give them a definition for slang terms.
Actually, that’s a very good equivalence: emoji are to Unicode as slang terms are to dictionaries. In both cases, people think it’s ridiculous that the authors of the text would include them; but in both cases, the usefulness of the work would be hampered if they didn’t.
> having emoji just be a purely clientside rendering feature
Please, please, PLEASE don't do that.
Have you ever tried pasting any kind of code (especially C++) into Skype? It always turns into a fiasco of stupid emojis. And even if you could disable it on your client side (which you can't even do), it would turn up as emoji garbage on the receiving end - and you might not even know/realize!
Just thinking about "let's show everything that's might be an emoji as an emoji" makes me shudder...
I'd also note that "you can alias it for different languages" is bound not to work. The same word means different things in different languages - which ones do you accept as aliases? How are native speakers supposed to know when their word for something can't be used because it means something else in any other language? I mean, look up the German adjective that means "fat"...
The major issue I find with it is there aren't limits to how many are added. Emoji are growing at a fantastical pace, much like a tumor, because "if X got added, why didn't Y?", and so on.
As the other child reply mentioned, this is latin-centric. Totally useless for people that can't or don't want to speak English, and having to duplicate it for every language defeats the entire purpose.
> and having to duplicate it for every language defeats the entire purpose.
Does it?
If you want accessible emoji, you need a display label for each emoji in every language that the client supports. Whether or not the emoji is represented in Unicode doesn't change that.
Also when users are entering emoji into a client application, they need a way to quickly filter and get to the emoji that they want -- that requires having a label in their native language, and putting emoji in Unicode doesn't fix that either.
In any practical setting, accessibility/input means you need multiple labels for different languages anyway. So why are we trying so hard to avoid them in the final text representation? If :cat: :gato: :貓: were part of the emoji definition in a big list somewhere, it would only make it easier to support multiple languages, since I wouldn't need to compile my own list of translations/labels.
> Also when users are entering emoji into a client application, they need a way to quickly filter and get to the emoji that they want -- that requires having a label in their native language, and putting emoji in Unicode doesn't fix that either.
This is different from -forcing- people to memorize the specific identifier in order to input it.
How do I find a specific face emoji if I don't know the name? I use the system's emoji picker tool and simply scroll through it. On OSX, it shows me recently used ones, which suffices. I don't really have to think about the names at all.
> use the system's emoji picker tool and simply scroll through it.
I don't see how this would change. You'd pick the cat emoji exactly the same way you do right now, and since your system language is set to English, iOS would insert :cat:, then immediately render it to an image representation.
Users wouldn't need to memorize a label any more than they currently need to memorize the Unicode positions.
I until very recently worked on Microsoft Word. The whole problem gets even more complicated when you add support for richer content like formatting, images, comments, etcetera. Sprinkle in requirements for things like three-way merge, simultaneous editing from multiple authors, undo behavior on top of that, and the amount of cross-cutting complexity for something seemingly simple can be absolutely astonishing.
Hey Nick, I'm glad to hear about that you are in the Word team:)
I'm working on a document manager and static site builder based on Mirosoft Word (https://docxmanager.com/) (The upcoming version will have a more user-friendly tabbed UI)
· Advanced typography, including...
· About line breaking...
· Hyphenation
· Optimized line break (Knuth-plass, etc.)
· About microtypography
· OpenType features (ligatures, etc.)
· OpenType variations
· Multiple master fonts
· Proper font fallback ← It is not a simple lookup-at-each-character process
· Advanced Middle East features, including...
· Kashida
· Advanced Far East features, including...
· Kinsoku Shori
· Auto space insertion
· Kumimoji
· Warichu
· Ruby
· Proper vertical layout, including...
· Yoko-in-Tate
· Inline objects and paragraph-like objects, including...
· Images
· Hyperlinks
· Math equations ← This is really hard.
I once wrote a simple LaTeX renderer for a project, and boy was it hard. And all I had to do was support a subset of the whole thing! Even with clear biases towards left-to-right languages and just a subset of the LaTex, it was a nightmare: you'd do something, but realize that it wouldn't render correctly in a certain situation, rewrite the code to include more context or do another layout pass, and hit another issue somewhere else. It was slow work, complicated to debug, and utterly stymied many refactoring efforts. I can't even imagine how much work must go into doing this quickly and correctly for a language that's much more complex…
I did quite a lot of hacking around the text selection when extending CKEditor for XML editing.
In my opinion there are several reasons why text input and selection are so difficult:
1. The behavior is complex.
2. The behavior resists being formally defined. It differs across the user interfaces, such as URL bar, rich text editor, terminal, etc.
3. The requirements are evolving. The recent big change - the emojis - is still not universally supported.
4. The reasons above tend to add accidental complexity to the API that will only grow over time.
I feel like the best approach would be to create a mathematical theory describing the text input semantics. This approach worked quite well for other complicated areas in CS, such as concurrency or memory management.
This is the first time I've ever seen it explained how input methods allow users to write in languages like Chinese using the Latin letters A-Z: phonetics. I had always wondered about this, it seemed like some arcane magic to me lol, how people knew what to type.
It's now clear that for someone like me - whom only understands a few Latin based languages - a super comprehensive tutorial would be required if one wanted to understand enough about other languages to be able to work on a text editor etc. I guess ideally all teams working on such projects would have an experienced team member whose native language isn't Latin based.
You can try out how it works with Google Translate, for languages like Chinese, Japanese etc. they have a button in the bottom right corner of the text field where you can toggle to an IME input system.
It's pretty intuitive, basically you just write the word phonetically and then press space, and it'll automatically replace the characters and let you choose alternatives. Smartphones have similar systems but with different layouts.
Chinese here. To be a little picky, phonetics is just one of the several methods to input Chinese. (Pinyin is the dominant one, and yes it is phonetics, other ways uses shapes of the characters and maybe a lot faster, see https://en.wikipedia.org/wiki/Wubi_method for example.)
Speaking of input methods I always with there are good English input methods, it will be useful too! For example if user enters "compre" the input methods goes
1. comprehension 2. comprehensive (with an order depending on the conditional probability -- there is some interesting math behind input prediction, and with their help input can be a lot smooth, however I only see it used on phones not computers.)
>however I only see it used on phones not computers //
IDEs do prediction for input.
OpenOffice used to predict words longer than say 5 characters, IIRC, it might have been back in the Star Office iteration. Come to think of it I write in a widower so little now I've only just realised that this feature might have been disabled as a default (or removed).
This barely scratches the complexity of IMEs. For example Hong Kong mainly speak a different dialect of chinese, Cantonese. So pinyin (mandarin phonetics) is useless.
Tangential: Microsoft's automatic selection of whole words and sentences is the single biggest productivity drain in my professional career (all my attempts to disable it failed so far).
The setting does not apply to Outlook when writing/editing emails, only when reading emails. Then, from my experience, the setting will be reverted with the next Word update :(.
Gah, I've just started with MS Word (after a long hiatus) and was wondering why it always grabs surrounding punctuation into the selection (even when the punctuation 'points' the wrong way, it seems).
I remember being annoyed by that years ago the few times I used Windows. I am used to triple-click+drag to achieve this when I need it. When I actually don't want to select words, it gets in the way. One can probably use the keyboard for precise things like that though.
I can see triple-clicking+drag being tricky and that most people probably want to select entire words most of the time anyway, so I don't know what the best solution is. I'd argue that selections need handles like on mobile operating systems so it can be edited afterwards.
Not mentioned in the article, but an interesting point of difficulty in editors is the character transposition command (Ctrl-T on most MacOS applications, stemming from emacs I imagine):
TextEdit and VSCode are the only editors I know of to handle transposing emoji in a reasonable way. VSCode will separate the emoji from the color modifier. TextEdit keeps them together, other editors I know of corrupt the data by slicing surrogate pairs or just no-op the command.
Wait, there are people who use the ctrl-T command? That thing is the bane of my existence because I have muscle-memory for "new tab" in firefox but when I do it in a text editor for "new file" out of habit it just jumbles text.
VSCode does the "correct" behavior of Bad #3, but doesn't even need to do the "bad" part about pushing the bytewise carat position around, as it logically maintains two characters, but visually coalesces both the middle position and the front together. Wonder why it wasn't mentioned.
It's probably bad because there isn't an additional kludge added: decomposition of combined character entities for editing. This would involve a concept of sub-character (code-point) 'tabs' (which ideally would be distinct UTF-8 entities).
Another option would be that deleting the 'a' doesn't completely delete it but instead replaces it with a zero-width space or zero-width non-joiner, so it looks like Bad #1 (but is Unicode-compliant) and hitting delete again gives Bad #3.
The whole example is a bit contrived though, nobody is going to enter in a skin tone modifier character by hand in daily use. They'll select an appropriately-colored emoji.
IMO, having delete trigger a zero-width space insertion would be the worst option. I mention in a sibling comment that VSCode gets around this by having two separate logical carat positions combined into a single visual position. So the byte offset of the cursor changes as expected, while still maintaining "Unicode correctness", for whatever thats worth.
Every time I hear someone is implementing even a part of text editing, I begin to snigger. (Ahem Workflowy cough.)
I'll gladly put these two articles in my reading list, to go over some evening while reclined comfortably and sipping wine—so I then can laugh and slap my knees even harder when a new text editor is mentioned.
Some years ago i was working on a project to autocorrect text with high level written text recognition. I have used standard text editor. Man, i thought 1 day about rewriting text field. But angels stopped me from it. People dont know how hard it is!
This is incredible, thanks for sharing. I had the exact same thoughts the author is describing in the beginning - how hard can it be? It turns out super hard. And one shouldn't forget that it is one of the few input methods to our computers. With voice not working properly and drawing on the mouse pad / touchscreen being too slow I would argue it is still the number one input method. So the expectations towards UX are extremely high and there should be no faults. I admire anybody who works on some kind of text input mechanisms from this day on.
The problem the author states in Vim is overstated. All you need is to write two functions that switch the input method to English when entering normal mode and back to your original input method when entering insert/replace mode. It's only 24 lines of vimscript including whitespaces and comments.
Out of all the TUI programs out there, Vim with it's modal design is probably the one program least affected by a conflation between keypresses and text input.
This is not really cross-platform. This is also not cross-application, so it's often necessary to do os-level, hence prior problems. At least for Spanish there are a limited number of non-ascii chars, so I jus bind directly to a modifier.
a common issue in browsers are pages looking for the ESC key. I'm in some modal dialog on a page. I'm typing CJK in the IME. I press ESC to exit the IME or to cancel a conversion, the IME exits but the page gets the ESC key and closes the model dialog. Any text I had entered to that point is lost.
Personally I've always thought RTL switches in an editor were a mistake. An editor displays and lets you edit a character stream, and those characters do not need to be in their final spatially rendered position for editing.
Really appreciate this post. But I’ve gotta admit that I CTRL-Fed it for ligatures in bidirectional text and Czech accents and found no result...and couldn’t figure out whether I was happy or sad about that.
This is one of the best things I have seen on HN in a long time! It really makes me want to go work for a company with a text editor product or to teach a course on implementing simple text editors.
> Our carets will need an extra bit that tells them which line to tend towards. Most systems call this bit “affinity”.
Is it just me, or does the author disprove themselves with the figure in the same section? I feel like the best possible solution is the one depicted: that you just see the cursor split between both lines.
I know, but two cursors sort of looks like a split cursor, and you could visually tweak cursor rendering to make “one cursor split between two lines” a visually-distinct case from having actual multiple cursors (because some text editors do indeed support multiple cursors.) What I’m saying is that I’d prefer a text-edit control that gives you a visual indicator for “one cursor split between two lines”, to one that pretends the cursor is on one line or the other, when it really will act with the navigation semantics of being split between two lines.
To be honest, I can't help but think that we're overcomplicating things for ourselves. I mean: think about a typewriter, hell, even typesetting. And now let's consider modern text editing. It's crazy! And yet, the only thing I need is something even simpler than vi: ability to move between the lines when doing cat - > file.
That being said, i do appreciate the effort the good people of text editing make.
I specifically mentioned Korean typewriters because they had an ingenious design to implement Hangul's combinatoric system---they put the final jamo back to the caret without advancing. It even works flawlessly in the modern computer typography, provided that you type a correct key for each jamo. To my knowledge widespread Arabic and Japanese typewriters were limited (Arabic generally had only initial and isolated forms [1], and Japanese was generally Katakana only---a full Kanji typewriter was more like a mini typesetter).
Even back in the typewriter era, it was not simple. Everytime I see people screaming that Unicode and internationalization is overcomplicated, I like to challenge them with counterexamples.
> Right, but you probably mostly deal with a documents written in a language with a Latin script. Those who don't or can't do this need better.
So, I absolutely love different scripts, and I think it's awesome how many different ways to write mankind has invented over the centuries.
But if Latin-style scripts are so much easier to input into computers, and if computers are not just the wave of the future but the present as well, maybe it would make sense for more languages to adopt Latin-style scripts. After all, we stopped using fractions everywhere in favour of decimals, because decimals are in many cases (not all!) better. Ditto our switch from Roman to Arabic numerals.
> But if Latin-style scripts are so much easier to input into computers, and if computers are not just the wave of the future but the present as well
But the hard part isn't supporting _a_ script, it's supporting _all_ scripts (or at least, all relevant scripts, in our modern interconnected age). It's only moderately more complicated, even with existing latin-centric tools, to only support a single script. Heck, If you only support Chinese + Japanese for example: you can always assume monospacing, which makes layout a breeze; line-wrapping rules are extremely simple [0]; and word-breaking (and associated headaches, like hyphenation and inter-word spacing) is practically a non-factor.
"Just drop your language's writing system in favor of the latin alphabet!"
"It's a hard problem (and doesn't affect me) so let's ignore it altogether!"
Must be easy to make remarks like this when your native language happens to use the latin alphabet...
How about we use our knowledge to build computer systems that respect different writing systems, even if it's hard, instead of enforcing a latin hegemony upon the rest of the world.
Comparing scripts to numeral notation is comparing apples to oranges. The latin syllabary does not work for languages that have more sounds than are represented by 26 characters!
I agree with the ironic sentiment. We've pushed up the complexity of a lot of features because we can, not because the value is there. Unicode itself is a great example of such: it's a dumping ground for "everything that resembles text" - and while it's easy to consume and there is some benefit to be had, it's also hard to comprehend on any level beyond the very basics. It's quite a ways away from the early telegraphy encoding systems.
With respect to selection, selection is key to all forms of interactive editing because without it you don't have interactivity, you have a linear workflow. As such it does deserve some respect as a fundamental feature, even if it's not critical to fix every last bug. Even if you are working via screen reading and voice to text, you can still benefit from having selection tools available to you.
If your goal is, for example, to edit a book, you often need much more than a typewriter can give you. You need figures, equations, centralized text, footnotes, rich text styles, tables.
If your goal is even more ambitious, say, publishing the same book online, such that others can use the contents of the book and select it, look it up, share it, edit it etc., you need many of the above to be understood by the text engine, so you end up with shaping, support for all human scripts with all their accumulated idiosyncrasies etc.
Without any of this, you can barely have an academic life for example on the internet,or engineering disciplines and so on.
"Hate" is a strange choice of word to describe what is merely a complex subject. Lots of different models (emoji + skin tone modifiers is not a bad model in isolation, very much like a letter + combining accent) are developed independently and then when we test them with the rest of the players on the field, they don't integrate too well. Yes, they don't, but what would you expect? :) And of course decisions made by different developers will vary in logic, consistency, and completeness. This is normal state for the integration part.
It's a phrase. You approach a seemingly simple problem that reveals itself to be a fractal of complexity - at some point you start to feel as if the problem itself had agency and it wanted to make you miserable.
I've naively implemented text rendering, selection, etc from scratch in a text editor, and if you only support monospace, it's pretty simple. The hardest part in software is to not implement features. A trick is to only implement the complex features for the users that need it - in another VCS branch. Or you make an abstraction layer. But abstraction layers are hard, and if they are not air tight (non leaky) you are worse off, because then the developer need to know both layers. Then there is the third strategy, which I would call an anti-pattern, that when enough features have built up, in order to get rid of the relics you rewrite it from scratch, preferably in a new framework. I also believe most advancements in science are accidental, like penicillin, and lexically scoped modules - so if we do not do things, like new frameworks, and re-invent the wheel, the CS field wouldn't advance.
As almost an exclusive vim user (large scale java doesn't work very well on vim), I find it funny when text editing is a huge pain - if there was pain for me that's about a month of overall time sunk into getting really good at vim. After the transition text editing has mostly been a joy.
There are an excessive amount of programs that conflate key presses & text input, and ones that don’t consider input methods.
I use macOS & Linux, and while the default text handling system called Cocoa Text System in macOS handles input methods well, almost all applications that implement it’s own, like big apps like Eclipse and Firefox, don’t get this right.
On Linux, it’s terrifying; I’ve never seen any app that allows input systems to work naturally, and after a week of use you get used to pressing space & backspace after finishing every Hangul word. The Unix-style composability they want (apps should work whether or not input methods are used - and looks like Linux users that use Latin characters don’t use any input methods (opposed to macOS where Latin characters are input by a Latin input system), so looks like this state will persist.
About the emoticons, I’m not that concerned with that since most (if not all) users won’t really input the color modifier separately (or even encounter files that have a separate one), so you can just select a sensible behavior like the #2 or 3 or 4. Users who understand the color modifiers, and other Unicode fiasco will understand what is happening under the hood, and ones that don’t will just think the file is broken and none of the behaviors will make sense, whatever you do.