God bless maths

The answer to the inline styling debate is maths, specifically the Mathematical Alphanumeric Symbols unicode block. This beast contains every latin letter in bold, italic, bold italic, and other nonsense. Here is an example: 𝗪𝗵𝘆 𝖎𝖘 𝘶𝘯𝘪𝘤𝘰𝘥𝘦 𝕥𝕙𝕚𝕤 𝓯𝓪𝓷𝓬𝔂 𝚊𝚗𝚍 𝚘𝚛 ｂｌｏａｔｅｄ. That last one was not in the math letters block but in the full and halfwidth block.

I acknowledge that doing this is a mess in terms of accessibility and a affront to users preferences, but…

Previewing this post shows that the text just gets normalized to unstyled latin characters in the feed view but does not in the normal view. So I pray that you folks are able to see what can.

Posted in: s/Gemini

🌙 manat

Apr 28 · 9 days ago

19 Comments ↓

🚀 lars_the_bear · Apr 28 at 12:18:

Yes, yes, oh.... YES!

This works on the Linux and Android versions of Lagrange, but I had to install additional fonts. It works on Alhena without any additional stuff, and also with my own "Caztor" client. It even works in Amfora at the Linux terminal.

Given how many clients support this, I don't see many downsides to using it. I'm going to extend my Markdown-to-Gemtext converter to be able to output these codes automatically.

🐑 zipsegv · Apr 28 at 12:27:

the downside is that screenreaders can not cope with these at all, which is terrible if you care about accessibility at all.

I used to use these for my site before someone pointed this out.

🎲 lab6 · Apr 28 at 13:23:

Another usability issue is that searching for the apparent text won’t find it unless you type the actual text, which will need characters not readily available on most keyboards.

🌆 skyjake [mod...] · Apr 28 at 14:02:

Abusing Unicode like this is a bad idea. Here are some previous comments for reference:

— /s/Gemini/31392

— /s/Geminispace/3238

🚀 lars_the_bear · Apr 28 at 14:52:

It's not an abuse -- these characters are part of the Unicode standard. If I use the math characters to represent, say, a variable name in an expression, I'd expect it to be read as a word. If your screen reader can't read these words out, log a bug.

Having said that, I agree that it's expecting a lot for a Gemini client to recognize these characters in a text search.

🌙 manat [OP] · Apr 28 at 14:54:

@lab6 I knew about the accessibility issue but never thought about the search engine one. Big web search engines like ddg seem to normalize the letters, so copy pasting the “fancy” from my post yields the definition of fancy. But expecting gemini search engines to do the same for such a small use case is stupid.

🌙 manat [OP] · Apr 28 at 15:30:

@lars_the_bear The issue is those characters get read as their unicode names, or if you get lucky, they get read one letter at a time.

— https://blog.iconfactory.com/2018/07/listening-to-poo-your-emoji-and-you/

🚀 lars_the_bear · Apr 28 at 15:44:

@mant : "...if you get lucky, they get read one letter at a time."

Yeah. That's why you should log a bug :)

But, despite my initial enthusiasm, I'm going off this idea, in general -- although I can see how it might be useful in some situations.

The problem is that Gemtext is too broken to fix with band-aids like this. Better to use something different entirely.

🌲 Half_Elf_Monk · Apr 28 at 15:53:

It seems like clients and search engines both could be able to do a dynamic substition of characters. So every time it sees weird emphasis-unicode, it could 'translate' and substitute that for whatever the 'normal' unicode character would be. Likewise, clients (if not editors) could implement some kind of dynamic substition when entering text, thus that a user could "bold" the text and the client substitues characters appropriately.

That said, if this sort of thing starts to see widespread use, it seems like it'd work against one of the major things I like about gemini: the ease of access. I don't just mean for a retrocomputing crowd, but also the sort of person who wants to write a simple gemini client. Added unicode fanciness seems like it complicates, and thus raises, the barrier to entry. Keep that low.

𝕆𝕥𝕙𝕖𝕣𝕨𝕚𝕤𝕖, 𝕥𝕙𝕚𝕤 𝕚𝕤 𝕡𝕣𝕖𝕥𝕥𝕪 𝕔𝕠𝕠𝕝, and there are lots of neat possibilities.

— A site where you can substitute text for text

🚀 lars_the_bear · Apr 28 at 16:20:

@Half_Elf_Monk : "I don't just mean for a retrocomputing crowd..."

I'm not sure I see Gemini getting that much interest from retrocomputing enthusiasts, although I can certainly see why it might initial appeal (as it did to me).

The reliance on TLS is a problem, although I suppose there's Spartan if you don't like that. My concern is that there's too much reliance on Unicode already, even leaving aside the math glyph thing. We have emojis, line drawing characters, non-Latin alphabets, symbols...

None of these features is essential to Gemini, but they have become very widespread. I don't think my CP/M machine will handle this kind of thing very well.

🚀 fstfabi · Apr 28 at 17:50:

accessibility shouldn't be as big of a deal when unicode normalization exists for parsing through ＡＥＳＴＨＥＴＩＣ words.

at the very least it shouldn't be a dealbreaker because by that metric anything but unformatted plain text would be an issue.

there are more technical problems:

1. not every client supports all unicode characters. e.g. the non-BMP characters in the post don't render for me.

2. not every client renders all unicode correctly. like how do you render √(5+1)? technically √5̅+̅1̅ but depending on your client the line may not be continuous, or the square root will have a bar, etc.

3. they are latin only and thus exclude other languages.

you could use ANSI escape sequences but please don't. it's insane that some clients support that.

🎲 lab6 · Apr 28 at 19:56:

@fstfabi Is there actually a Unicode standard for this type of normalisation? I know there are libraries that have opinions about equivalences but I could not find a Unicode standard for it. There are standards for normalising sequences of combining diacritics into characters like Å but not for normalising 𝐴 into A (presumably because a semantic difference is intended, so this would be a lossy transformation rather than normalisation).

Still, I do think browsers and search engines should engage with the world as they find it, not how they would like it to be. These bold/italic-looking characters are widely (mis-)used for presentational purposes, so the horse has bolted.

If a user wants to search text in a page, and they enter the search string “ant”, then the user-friendly thing to do would be to find “𝕒𝕟𝕥” as well as “ant”.

I don’t see this as a reason to encourage (mis-)use of these characters. More of a case of being liberal in what you accept and conservative in what you do.

Trouble is, if we do the user-friendly thing, it opens the stable door completely, and maybe there are yet unbolted horses that may be saved.

🚀 fstfabi · Apr 28 at 20:22:

Still, I do think browsers and search engines should engage with the world as they find it, not how they would like it to be.

Fully agree. It is just the unfortunate and understandable complexity of Unicode. Frankly, I don't expect every GUI to support all the weird edge-cases.

As for the normalization, it's called "Compatibility Equivalence" and the forms are NFKD and NFKC.

— https://www.unicode.org/reports/tr15/

Here's all the properties of the 𝐴 in your example:

— https://util.unicode.org/UnicodeJsps/character.jsp?a=1D434

"Decomposition_Mapping", "NFKC_Casefold", and many other properties I barely understand myself.

🚬 sy · Apr 28 at 20:39:

@lab6 IIRC, it was called something like compatibility equivalence or compatibility mapping. But it would defeat its purpose in this particular usage case.

📻 eugene · Apr 28 at 21:16:

...and people whose languages don't use Latin letters are second class citizens and should never expect their needs to be met by anything, right. :)

🚀 lars_the_bear · Apr 29 at 07:20:

@eugene : "...and people whose languages don't use Latin letters are second class citizens and should never expect their needs to be met by anything, right. :)"

I think that's really the knock-down argument against the approach we're discussing here. Even common French and German letters aren't easily supported, so far as I can see.

But, frankly, the more I think about this, the more I realize that we shouldn't be doing _anything_ to extend the life of Gemtext. I'm not worried that the horse has bolted, but that it's lame, and needs to be put out of its misery.

📻 eugene · Apr 29 at 08:56:

@lars_the_bear

I ended up settling for u̲n̲d̲e̲r̲l̲i̲n̲e̲ for my blog (which is generated from markdown) which is better supported than you'd think. But it's still just a bandaid.

The problem with using markdown is "which variant?" While browsers that support markdown do exist, none of them even publicize what they actually support when they call it markdown - and no, it doesn't look like you can rely on CommonMark. I think something new is needed, some kind of gemtext-plus, which would:

Be *actually* easy to parse. If you think you can parse gemtext by using a switch on `split(" ")[0]`, no, carefully reading the spec implies you can't, and writing an elegant and uniform gemtext parser is impossible.
Support *some* inline markup, in a well-defined, unambiguous way.

The problem is that getting people to support it is effectively a lost cause, this topic has been raised repeatedly since Gemini became a thing.

☕️ tenno-seremel · Apr 29 at 13:08:

Let’s go with org markup (but without the Lisp parts) [pokerface]

📻 eugene · Apr 29 at 14:33:

@tenno-seremel

Org is *a bit* much.

But if we're talking seriously about this...

The idea that the parser works in terms of lines, each of which is one of text, header, list item or link (and otherwise exists in a vacuum) and line type can be determined based on the first characters of the line is, I think, the right thing to keep things simple. Gemtext only has one exception that switches the parsing mode, the preformatted text blocks. Ironically, the problem with writing a gemtext parser is that it lets you write all of these in a very lax way - whitespace separators are optional for everything *except* lists, for example, where they are suddenly required - which makes it a requirement to check for line types in a very specific order, otherwise you're going to miss someting, etc. This bit has to go. The type of line is determined by the first word, where the first word is separated from the rest of the line by whitespace.

One of the few things I would add at this stage are multi-level lists, but, a list item would still be a line in a vacuum. I.e. you would have `** foo` for a second level list item, which, when alone without any surrounding lines, would still be rendered as a second level list item. Similarly, `* 1. foo` would be a numbered top level list item -- the "1." would be the list item marker to be rendered optionally instead of (rather than after) the bullet, and the same logic would work for, e.g., `* A. foo`. Lists would not be a structural element of the document, because that requires the parser to consider all lines together instead of simply run through them one by one.

Any inline emphasis markers would be matching pairs. I.e. never use the same symbol/token to start and end a mode, use one to start it, one to end it, and all modes implicitly end when the line ends. Inline markers should only be available in text lines and nowhere else. Perhaps, use syntax like `{/italic/}`, `{*bold*}`, or something...

I don't think inline links are feasible without breaking this paradigm and I'm not sure they're really all that critical.