r - Turn parsed corpus into data frame using stringr and regex -
i'm trying transform parsed corpus data frame in r using stringr , regular expressions (i've since read maybe shouldn't using regular expressions kind of work, spent time on know whether there solution). corpus looks this:
text <- paste("<w type=\"np0\" lemma=\"dorothy\">dorothy</w><c type=\"pun\">, </c><w type=\"prp\" lemma=\"in\">in </w><w type=\"dps\" lemma=\"she\">her </w><w type=\"nn1\" lemma=\"time\">time</w><c type=\"pun\">, </c><w type=\"vhd\" lemma=\"have\">had </w><w type=\"vbn\" lemma=\"be\">been </w><w type=\"at0\" lemma=\"an\">an </w><w type=\"aj0\" lemma=\"active\">active </w><w type=\"nn1\" lemma=\"member\">member </w><w type=\"prf\" lemma=\"of\">of </w><w type=\"at0\" lemma=\"an\">an </w><w type=\"nn1\" lemma=\"organisation\">organisation </w><w type=\"vvn-vvd\" lemma=\"call\">called </w><w type=\"at0\" lemma=\"the\">the </w><w type=\"nn1\" lemma=\"noise\">noise </w><w type=\"nn1\" lemma=\"reduction\">reduction </w><w type=\"nn1\" lemma=\"society\">society</w><c type=\"pun\">, </c>")
i've got close want using this:
library("stringr") # extract type type <- str_extract_all(text, "<. type=\\\"(.*?)\\\"") %>% unlist() #extract word word <- str_extract_all(text, ">(.*?)<\\/.>") %>% unlist() #convert data frame df <- data.frame( type = type, word = word)
the problem want things appear between <w type = \"
, \"
etc., not characters themselves, (for first 2 words):
df2 <- data.frame(type = c("np0", "pun"), word = c("dorothy", ","))
again, understanding should learn, say, xml
package type of data, can want regular expressions?
you can use look around in order extract strings between. i've added str_trim
in order remove unwanted spaces around words
data.frame( type = str_extract_all(text , '(?<=type=\\")(.*?)(?=\\")')[[1]], word = str_trim(str_extract_all(text , '(?<=\\">)(.*?)(?=<)')[[1]], side = "both") ) # type word # 1 np0 dorothy # 2 pun , # 3 prp in # 4 dps # 5 nn1 time # 6 pun , # 7 vhd had # 8 vbn been # 9 at0 # 10 aj0 active # 11 nn1 member # 12 prf of # 13 at0 # 14 nn1 organisation # 15 vvn-vvd called # 16 at0 # 17 nn1 noise # 18 nn1 reduction # 19 nn1 society # 20 pun ,
Comments
Post a Comment