< /p>
> x <- "n̥ala"
> nchar(x)
[1] 5
What I want is 4, because n̥ should be considered a Characters (that is, diacritics should not be considered as their own characters, even if multiple diacritics are stacked on the basic character).
How can I get this result?
Use the Unicode package; it provides the Unicode_alphabetic_tokenizer function:
Tokenization first replaces the elements of x by their Unicode
character sequences. Then, the non- alphabetic characters (ie, the
ones which do not have the Alphabetic property) are replaced by
blanks, and the corresponding strings are split according to the
blanks.< /p>
After this, I used nchar but because I separated the two substrings of the previous function, I used a sum.
sum (nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4
I believe this package is very useful in this case, but I am not an expert and I don’t know if my solution is Applicable to all problems involving phonetic alphabets. Maybe other examples may help illustrate the effectiveness of my solution.
It works well
Here is another example:< /p>
> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2
Attach: There is only one “in the code, but copy and paste it, The second one appears. I don’t know why this happens.
I am trying to use characters with diacritics to get the number of characters in a string, But I cannot get the correct result. < p>
> x <- "n̥ala"
> nchar(x)
[1] 5
What I want It is 4, because n̥ should be considered as a character (i.e. diacritics should not be considered as its own character, even if multiple diacritics are stacked on the basic character).
How can I get Such a result?
This is my solution. The idea is that the phonetic alphabet can have a unicode representation, then:
Use Unicode package; it provides the Unicode_alphabetic_tokenizer function:
Tokenization first replaces the elements of x by their Unicode
character sequences. Then, the non- alphabetic characters ( ie, the
ones which do not have the Alphabetic property) are replaced by
blanks, and the corresponding strings are split according to the
blanks.
here After that I used nchar but because I separated the two substrings of the previous function, I used a sum.
sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4
I believe this package is very useful in this situation, but I am not an expert, and I don’t know if my solution is applicable to all problems involving phonetic alphabets. Maybe other Examples may help illustrate the effectiveness of my solution.
It works well
Here is another example:
> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum( nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2
Attachment: There is only one “in the code, but copy and paste it, the second one appears. I don’t know why this happens This situation.