text mining - Detecting two consecutive "Proper Case" words in a string using R -

May 15, 2012

i've been scratching head 1 while now. i'm attempting text mining in r, , looking try , classify names, places , organisations made of multiple words. purposes of task i'm looking @ consecutive words in string begin capital letters.

example string:

origstring <- 'the current president of united states donald trump'

is there way of finding words starting capital letter within string , grouping them return this?

newstring <- 'the current president of unitedstates donaldtrump'

any can give appreciated.

the following solution work groups of 2 words @ time:

origstring <- 'the current president of united states donald trump' gsub('([a-z]\\w*?)\\s+([a-z]\\w*)', '\\1\\2', origstring)

output:

[1] "the current president of unitedstates donaldtrump"

demo here:

rextester

update:

following script should work number of clustered capitalized words. required workaround/hack because regex flavor gsub() uses, in perl mode, not support variable length lookbehinds. strategy here instead selectively remove whitespace in between capitalized words appear in groups of 2 or more.

origstring <- 'the current president of united states donald trump' temp <- gsub('([a-z]\\w*)', '\\1\\$mark\\$', origstring) output <- gsub('(?<=\\$mark\\$)\\s+(?=[a-z])', '', temp, perl=true) output <- gsub('\\$mark\\$', '', output) output  [1] "the current president of unitedstatesdonaldtrump"

Search This Blog

How Y

text mining - Detecting two consecutive "Proper Case" words in a string using R -

rextester

demo

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -