regex - How can I split a street address in the below formats in unix or R or grep or awk? -
i have file italian street name , address, have split column of address street name , street number. catch addresses have 2 or 3 string , numbers or number times have character e.g 15/a of them have adress 12-maggio 23 , split should first column 12-maggio , second column 23.
below format of file
street.adress falcone n. 1 fortunato giustino 2 pisacane 3 fabrizio de andre' 8 s. satta 7 agnesi 16 volturno cigni 80 montepenice 6 cucchiari 15 molinetto di lorenteggio 15/t 7 don minzoni 15 senigallia 4 milano 38/a l. da vinci 13/a 27-novembre 9
output should in 2 separate columns
falcone n. 1 fortunato giustino 2 pisacane 3 fabrizio de andre' 8 s. satta 7 agnesi 16 volturno cigni 80 montepenice 6 6 cucchiari 15 molinetto di lorenteggio 15/t 7 don minzoni 15 senigallia 4 milano 38/a l. da vinci 13/a 27-novembre 9
how can achieve this, have tried excel formulas , unsplit not work. have tried in r below code fails, how can this?
for (i in 1:nrow (df)) { new_df [i,"street.name"] <- unlist(strsplit (df[["street.addresses"]], " ")[i])[1] new_df [i,"street.number"] <- paste (unlist(strsplit (df[["street.addresses"]], " ")[i])[-1], collapse = " ") }
tried
df <- gsub("$([0-9]+ +)?(.*)", "\\1\t\\2", df)
nothing works. leads
this regular expression combined gsub()
, strsplit()
works on data provided.
the trick here first insert \t
@ location want split string, use strsplit()
\t
separator.
x <- read.table(sep = "\n", header = true, quote = "\"", text = "street.adress falcone n. 1 fortunato giustino 2 pisacane 3 fabrizio de andre' 8 s. satta 7 agnesi 16 volturno cigni 80 montepenice 6 cucchiari 15 molinetto di lorenteggio 15/t 7 don minzoni 15 senigallia 4 milano 38/a l. da vinci 13/a 27-novembre 9" ) pattern <- "(.*?) +(\\d+.*)" z <- gsub(pattern, "\\1\t\\2", x[[1]]) unlist( strsplit(z, "\t") )
the results:
[1] "falcone n." "1" [3] "fortunato giustino" "2" [5] "pisacane" "3" [7] "fabrizio de andre'" "8" [9] "s. satta" "7" [11] "agnesi" "16" [13] "volturno cigni" "80" [15] "montepenice" "6" [17] "cucchiari" "15" [19] "molinetto di lorenteggio" "15/t 7" [21] "don minzoni" "15" [23] "senigallia" "4" [25] "milano" "38/a" [27] "l. da vinci" "13/a" [29] "27-novembre" "9"
ps. answer edited deal fact there quote '
in input data. deal this, have set quote = "\""
argument read.table()
otherwise lines skipped.
Comments
Post a Comment