unicode - Handling count of characters with diacritics in R -

January 15, 2011

i'm trying number of characters in strings characters diacritics, can't manage right result.

> x <- "n̥ala" > nchar(x) [1] 5

what want is 4, since n̥ should considered 1 character (i.e. diacritics shouldn't considered characters on own, more 1 diacritic stacked on base character).

how can kind of result?

here solution. idea phonetic alphabets can have unicode representation , then:

use unicode package; provide function unicode_alphabetic_tokenizer that:

tokenization first replaces elements of x unicode character sequences. then, non- alphabetic characters (i.e., ones not have alphabetic property) replaced blanks, , corresponding strings split according blanks.

after used nchar because splitting 2 substrings of previous function used sum.

sum(nchar(unicode_alphabetic_tokenizer(x))) [1] 4

i believe package can useful in such cases, not expert , not know if solution works problems involve phonetic alphabets. maybe other examples might useful state validity of solution.

it works well

here example:

> x <- "e̯ ʊ̯" > x [1] "e̯ ʊ̯" > nchar(x) [1] 5 > sum(nchar(unicode_alphabetic_tokenizer(x))) [1] 2

p.s. there 1 " in code copying , pasting it, second 1 appears. not know why happens.

Search This Blog

Call

unicode - Handling count of characters with diacritics in R -

it works well

Comments

Post a Comment

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -