The ICU function used to determine whether the Unicode code point is just one letter requires only one code point, So for any given code point, it has no way of knowing whether they have been combined with diacritics-or if it is diacritics, then it is combined with it. I am trying to achieve something similar to Unicode perception using a structure similar to The regular expression stuff
while(is_letter(codepoint))
However, I am very concerned about what happens if the codepoint is actually an diacritic , It will be sorted with the previous code points and other sorting marks.
Is it safe to do this? Or do I have to explicitly find and ignore diacritics and other collation marks?
Edit: What I really need to do is iterate characters, not code points.
This problem is a victim of the XY problem. I need to ask a question about my actual problem.
For combining diacritics, are they counted as letters?
In a broad sense , Diacritics are regarded as “marks” instead of “letters”. For example, as in <ś>, U 0301 COMBINING ACUTE ACCENT is a “non-spacing mark”, which is one of three types of “marks”. However, “Modified letters” called “letters” may be considered as diacritics; for example, U 02C0 MODIFIER LETTER GLOTTAL STOP, as in , is the “modifier letter”.
If If you check the main file of the Unicode Character Database (warning: it is a 1.3 MB text file), you can understand which characters are classified as “modifier letters” (Lm) and which characters are classified as “non-spacing marks” (Mn ) Or “spacing mark” (Ms) or “closed mark” (Me).
For combined diacritics, are they counted as letters? Because, as far as I know, they can only be combined with other letters in well-formed Unicode.
The ICU function used to determine whether the Unicode code point is just one letter requires only one code point, So for any given code point, it has no way of knowing whether they have been combined with diacritics-or if it is diacritics, then it is combined with it. I am trying to achieve something similar to Unicode perception using a structure similar to The regular expression stuff
while(is_letter(codepoint))
However, I am very concerned about what happens if the codepoint is actually an diacritic , It will be sorted with the previous code points and other sorting marks.
Is it safe to do this? Or do I have to explicitly find and ignore diacritics and other collation marks?
Edit: What I really need to do is iterate characters, not code points.
This problem is a victim of the XY problem. I need to ask a question about my actual problem.
I don’t know exactly what you are going to do, so if this is not the answer you want, I apologize in advance, but:
For combining diacritics, are they counted as letters?
In a broad sense, diacritics are regarded as “marks” instead of “Letters”. For example, as in <ś>, U 0301 COMBINING ACUTE ACCENT is a “non-spacing mark”, which is one of three kinds of “marks”. However, a “modified letter” called a “letter” may be Are considered diacritics; for example, U 02C0 MODIFIER LETTER GLOTTAL STOP, as in , is the “modifier letter”.
If you look at the main file of the Unicode Character Database (Warning : It is a 1.3 MB text file), you can understand which characters are classified as “modifier letters” (Lm) and which characters are classified as “non-spacing marks” (Mn) or “spacing marks” (Ms) or “closed” Mark “(Me).