text mining - Detecting two consecutive "Proper Case" words in a string using R -
i've been scratching head 1 while now. i'm attempting text mining in r, , looking try , classify names, places , organisations made of multiple words. purposes of task i'm looking @ consecutive words in string begin capital letters.
example string:
origstring <- 'the current president of united states donald trump'
is there way of finding words starting capital letter within string , grouping them return this?
newstring <- 'the current president of unitedstates donaldtrump'
any can give appreciated.
the following solution work groups of 2 words @ time:
origstring <- 'the current president of united states donald trump' gsub('([a-z]\\w*?)\\s+([a-z]\\w*)', '\\1\\2', origstring)
output:
[1] "the current president of unitedstates donaldtrump"
demo here:
update:
following script should work number of clustered capitalized words. required workaround/hack because regex flavor gsub()
uses, in perl mode, not support variable length lookbehinds. strategy here instead selectively remove whitespace in between capitalized words appear in groups of 2 or more.
origstring <- 'the current president of united states donald trump' temp <- gsub('([a-z]\\w*)', '\\1\\$mark\\$', origstring) output <- gsub('(?<=\\$mark\\$)\\s+(?=[a-z])', '', temp, perl=true) output <- gsub('\\$mark\\$', '', output) output [1] "the current president of unitedstatesdonaldtrump"
Comments
Post a Comment