r - Turn parsed corpus into data frame using stringr and regex -


i'm trying transform parsed corpus data frame in r using stringr , regular expressions (i've since read maybe shouldn't using regular expressions kind of work, spent time on know whether there solution). corpus looks this:

text <- paste("<w type=\"np0\" lemma=\"dorothy\">dorothy</w><c type=\"pun\">, </c><w type=\"prp\" lemma=\"in\">in </w><w type=\"dps\" lemma=\"she\">her </w><w type=\"nn1\" lemma=\"time\">time</w><c type=\"pun\">, </c><w type=\"vhd\" lemma=\"have\">had </w><w type=\"vbn\" lemma=\"be\">been </w><w type=\"at0\" lemma=\"an\">an </w><w type=\"aj0\" lemma=\"active\">active </w><w type=\"nn1\" lemma=\"member\">member </w><w type=\"prf\" lemma=\"of\">of </w><w type=\"at0\" lemma=\"an\">an </w><w type=\"nn1\" lemma=\"organisation\">organisation </w><w type=\"vvn-vvd\" lemma=\"call\">called </w><w type=\"at0\" lemma=\"the\">the </w><w type=\"nn1\" lemma=\"noise\">noise </w><w type=\"nn1\" lemma=\"reduction\">reduction </w><w type=\"nn1\" lemma=\"society\">society</w><c type=\"pun\">, </c>") 

i've got close want using this:

library("stringr")  # extract type type <- str_extract_all(text, "<. type=\\\"(.*?)\\\"") %>%     unlist()  #extract word word <- str_extract_all(text, ">(.*?)<\\/.>") %>%     unlist()  #convert data frame df <- data.frame(     type = type,      word = word) 

the problem want things appear between <w type = \" , \" etc., not characters themselves, (for first 2 words):

df2 <- data.frame(type = c("np0", "pun"), word = c("dorothy", ",")) 

again, understanding should learn, say, xml package type of data, can want regular expressions?

you can use look around in order extract strings between. i've added str_trim in order remove unwanted spaces around words

data.frame(   type = str_extract_all(text , '(?<=type=\\")(.*?)(?=\\")')[[1]],   word = str_trim(str_extract_all(text , '(?<=\\">)(.*?)(?=<)')[[1]], side = "both") )      #       type         word # 1      np0      dorothy # 2      pun            , # 3      prp           in # 4      dps          # 5      nn1         time # 6      pun            , # 7      vhd          had # 8      vbn         been # 9      at0           # 10     aj0       active # 11     nn1       member # 12     prf           of # 13     at0           # 14     nn1 organisation # 15 vvn-vvd       called # 16     at0          # 17     nn1        noise # 18     nn1    reduction # 19     nn1      society # 20     pun            , 

Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -