Comment by 🎲 lab6

@fstfabi Is there actually a Unicode standard for this type of normalisation? I know there are libraries that have opinions about equivalences but I could not find a Unicode standard for it. There are standards for normalising sequences of combining diacritics into characters like Å but not for normalising 𝐴 into A (presumably because a semantic difference is intended, so this would be a lossy transformation rather than normalisation).

Still, I do think browsers and search engines should engage with the world as they find it, not how they would like it to be. These bold/italic-looking characters are widely (mis-)used for presentational purposes, so the horse has bolted.

If a user wants to search text in a page, and they enter the search string “ant”, then the user-friendly thing to do would be to find “𝕒𝕟𝕥” as well as “ant”.

I don’t see this as a reason to encourage (mis-)use of these characters. More of a case of being liberal in what you accept and conservative in what you do.

Trouble is, if we do the user-friendly thing, it opens the stable door completely, and maybe there are yet unbolted horses that may be saved.

🎲 lab6

Apr 28 · 9 days ago

7 Later Comments ↓

🚀 fstfabi · Apr 28 at 20:22:

Still, I do think browsers and search engines should engage with the world as they find it, not how they would like it to be.

Fully agree. It is just the unfortunate and understandable complexity of Unicode. Frankly, I don't expect every GUI to support all the weird edge-cases.

As for the normalization, it's called "Compatibility Equivalence" and the forms are NFKD and NFKC.

— https://www.unicode.org/reports/tr15/

Here's all the properties of the 𝐴 in your example:

— https://util.unicode.org/UnicodeJsps/character.jsp?a=1D434

"Decomposition_Mapping", "NFKC_Casefold", and many other properties I barely understand myself.

🚬 sy · Apr 28 at 20:39:

@lab6 IIRC, it was called something like compatibility equivalence or compatibility mapping. But it would defeat its purpose in this particular usage case.

📻 eugene · Apr 28 at 21:16:

...and people whose languages don't use Latin letters are second class citizens and should never expect their needs to be met by anything, right. :)

🚀 lars_the_bear · Apr 29 at 07:20:

@eugene : "...and people whose languages don't use Latin letters are second class citizens and should never expect their needs to be met by anything, right. :)"

I think that's really the knock-down argument against the approach we're discussing here. Even common French and German letters aren't easily supported, so far as I can see.

But, frankly, the more I think about this, the more I realize that we shouldn't be doing _anything_ to extend the life of Gemtext. I'm not worried that the horse has bolted, but that it's lame, and needs to be put out of its misery.

📻 eugene · Apr 29 at 08:56:

@lars_the_bear

I ended up settling for u̲n̲d̲e̲r̲l̲i̲n̲e̲ for my blog (which is generated from markdown) which is better supported than you'd think. But it's still just a bandaid.

The problem with using markdown is "which variant?" While browsers that support markdown do exist, none of them even publicize what they actually support when they call it markdown - and no, it doesn't look like you can rely on CommonMark. I think something new is needed, some kind of gemtext-plus, which would:

Be *actually* easy to parse. If you think you can parse gemtext by using a switch on `split(" ")[0]`, no, carefully reading the spec implies you can't, and writing an elegant and uniform gemtext parser is impossible.
Support *some* inline markup, in a well-defined, unambiguous way.

The problem is that getting people to support it is effectively a lost cause, this topic has been raised repeatedly since Gemini became a thing.

☕️ tenno-seremel · Apr 29 at 13:08:

Let’s go with org markup (but without the Lisp parts) [pokerface]

📻 eugene · Apr 29 at 14:33:

@tenno-seremel

Org is *a bit* much.

But if we're talking seriously about this...

The idea that the parser works in terms of lines, each of which is one of text, header, list item or link (and otherwise exists in a vacuum) and line type can be determined based on the first characters of the line is, I think, the right thing to keep things simple. Gemtext only has one exception that switches the parsing mode, the preformatted text blocks. Ironically, the problem with writing a gemtext parser is that it lets you write all of these in a very lax way - whitespace separators are optional for everything *except* lists, for example, where they are suddenly required - which makes it a requirement to check for line types in a very specific order, otherwise you're going to miss someting, etc. This bit has to go. The type of line is determined by the first word, where the first word is separated from the rest of the line by whitespace.

One of the few things I would add at this stage are multi-level lists, but, a list item would still be a line in a vacuum. I.e. you would have `** foo` for a second level list item, which, when alone without any surrounding lines, would still be rendered as a second level list item. Similarly, `* 1. foo` would be a numbered top level list item -- the "1." would be the list item marker to be rendered optionally instead of (rather than after) the bullet, and the same logic would work for, e.g., `* A. foo`. Lists would not be a structural element of the document, because that requires the parser to consider all lines together instead of simply run through them one by one.

Any inline emphasis markers would be matching pairs. I.e. never use the same symbol/token to start and end a mode, use one to start it, one to end it, and all modes implicitly end when the line ends. Inline markers should only be available in text lines and nowhere else. Perhaps, use syntax like `{/italic/}`, `{*bold*}`, or something...

I don't think inline links are feasible without breaking this paradigm and I'm not sure they're really all that critical.

🌒 s/Gemini

🌙 manat:

God bless maths — The answer to the inline styling debate is maths, specifically the Mathematical Alphanumeric Symbols unicode block. This beast contains every latin letter in bold, italic, bold italic, and other nonsense. Here is an example: Why is unicode this fancy and or ｂｌｏａｔｅｄ. That last one was not in the math letters block but in the full and halfwidth block. I acknowledge that doing this is a mess in terms of accessibility and a affront to users preferences, but… Previewing this post…

💬 19 comments · Apr 28 · 9 days ago