text mining - Detecting two consecutive "Proper Case" words in a string using R -


i've been scratching head 1 while now. i'm attempting text mining in r, , looking try , classify names, places , organisations made of multiple words. purposes of task i'm looking @ consecutive words in string begin capital letters.

example string:

origstring <- 'the current president of united states donald trump' 

is there way of finding words starting capital letter within string , grouping them return this?

newstring <- 'the current president of unitedstates donaldtrump' 

any can give appreciated.

the following solution work groups of 2 words @ time:

origstring <- 'the current president of united states donald trump' gsub('([a-z]\\w*?)\\s+([a-z]\\w*)', '\\1\\2', origstring) 

output:

[1] "the current president of unitedstates donaldtrump" 

demo here:

rextester

update:

following script should work number of clustered capitalized words. required workaround/hack because regex flavor gsub() uses, in perl mode, not support variable length lookbehinds. strategy here instead selectively remove whitespace in between capitalized words appear in groups of 2 or more.

origstring <- 'the current president of united states donald trump' temp <- gsub('([a-z]\\w*)', '\\1\\$mark\\$', origstring) output <- gsub('(?<=\\$mark\\$)\\s+(?=[a-z])', '', temp, perl=true) output <- gsub('\\$mark\\$', '', output) output  [1] "the current president of unitedstatesdonaldtrump" 

demo


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -