It seems that CJK (Chinese-Japanese-Korean) posts are effectively limited to about 100 characters due to limit of 1024 bytes for URIs in Gemini (each character is 9 bytes after encoding). Has there been discussion on this matter? It constrains CJK posts to about 100 characters: a sentence or two.

Posted in: s/AskGemini

🍵 tacomanator

Mar 11 · 8 weeks ago · 😭 1

10 Comments ↓

🌆 skyjake [mod...] · Mar 11 at 12:31:

It comes up occasionally. Per the specification, query strings are percent-encoded UTF-8 and the entire query is max 1024 bytes, so unfortunately yes, CJK characters, Emoji, and other characters that have a long multibyte encoding will quickly fill up the request.

The only workaround is to compose longer messages using multiple requests. That is a server-side app implementation issue.

(Using a different protocol like Titan is the alternative solution.)

🚂 MrSVCD · Mar 11 at 22:07:

Since utf-8 encoded unicode is one of my fixation I want to add this:

Unicode utf-8 is at maximum of 4 bytes per character.

Korean hangul is up to 3 sounds per character, Chinese is whole words in 4 bytes or less and Japanese, if I remember correctly, is word or two sounds or foreign sounds, since they have 3 alphabets

🍵 tacomanator [OP] · Mar 11 at 22:56:

@skyjake Thank you — I had the same thought about multiple requests. Do you know of anyone doing this already?

I looked into Titan. From what I can tell, it essentially bolts a PUT-like operation onto Gemini. I could not find much evidence of it being used anwhere, beyond perhaps a wiki server implementation. Do you know if it has it been applied to sites like Station or a BBS?

🍵 tacomanator [OP] · Mar 11 at 23:03:

@MrSVCD thank you for the clarification. Japanese has three alphabets, and most of the characters across all three are 3 bytes each and 9 bytes each when percent encoded.

Japanese, and to a degree Korean, are the most affected by the byte limit as they use more characters than Chinese to express the same thing due to particles, conjugations, etc.

🚀 ColonelThirtyTwo · Mar 12 at 02:25:

@MrSVCD UTF8 is max 4 bytes per character but they then get percent encoded, further driving up the bytes per character

🌆 skyjake [mod...] · Mar 12 at 08:00:

@tacomanator Bubble (that runs this site) supports Titan for making and editing long posts. This is documented in the Help:

— /help

Using the Bubble draft composer, you effectively can submit long posts and comments as multiple Gemini requests as well.

Station does not support Titan nor does it allow appending text to previously submitted entries.

Titan is used by some to edit their capsules, gemlogs, and/or tinylogs. I have no examples off the top of my head apart from my own skyjake.fi, where I've got a private Titan edit feature.

🚂 MrSVCD · Mar 12 at 10:39:

@ColonelThirtyTwo That is true but the most common C&K characters have their own entries in unicode.

I think that unicode is trying to go precent encoded to not go to 5 bytes of utf-8.

🍵 tacomanator [OP] · Mar 12 at 23:58:

@skyjake thank you for your help. From there I found a way to post long text from the draft page after enabling Titan in the BBS settings.

The help mentions a ":" command to enter long text mode. I haven't figured how to get that to work yet, but for now I'm happy to have least one have one working method!

🚬 sy · Mar 13 at 15:47:

Maybe this (RFC2718 §2.2.5) should be explicitly allowed in gemini specification:

Unless there is some compelling reason for a particular scheme to do otherwise, translating character sequences into UTF-8 and then subsequently using the %HH encoding for *unsafe* octets is recommended.

Apparently most servers –including BBS and station– already allow it.

— Test with more than 300 kanji characters

🚂 MrSVCD · Mar 13 at 18:04:

Thanks @sy, that explains the difference between what I thought and what op said.