unicode - Handling count of characters with diacritics in R -
i'm trying number of characters in strings characters diacritics, can't manage right result.
> x <- "n̥ala" > nchar(x) [1] 5 what want is 4, since n̥ should considered 1 character (i.e. diacritics shouldn't considered characters on own, more 1 diacritic stacked on base character).
how can kind of result?
here solution. idea phonetic alphabets can have unicode representation , then:
use unicode package; provide function unicode_alphabetic_tokenizer that:
tokenization first replaces elements of x unicode character sequences. then, non- alphabetic characters (i.e., ones not have alphabetic property) replaced blanks, , corresponding strings split according blanks.
after used nchar because splitting 2 substrings of previous function used sum.
sum(nchar(unicode_alphabetic_tokenizer(x))) [1] 4 i believe package can useful in such cases, not expert , not know if solution works problems involve phonetic alphabets. maybe other examples might useful state validity of solution.
it works well
here example:
> x <- "e̯ ʊ̯" > x [1] "e̯ ʊ̯" > nchar(x) [1] 5 > sum(nchar(unicode_alphabetic_tokenizer(x))) [1] 2 p.s. there 1 " in code copying , pasting it, second 1 appears. not know why happens.
Comments
Post a Comment