python - How to make casefold() work on certain Arabic unicodes -


i've got issues detecting "equality" in python 2.7 of arabic pairs of words:

  1. أكثر vs اكثر
  2. قائمة vs قائمه
  3. إنشاء vs انشاء

the elements of each pair not identical, written different cases. useful analogy me (i don't know arabic) word vs word. not identical, if lowercase both of them, i'll obtain word vs word, identical. that's want obtain these 3 pairs of arabic words.

i'm going exemplify tried using first pair (1. أكثر vs اكثر). way, meaning of both arabic words first pair "menu" "more", have different cases (as parallel: menu vs menu more vs more). don't know arabic @ nor arabic rules, if knows arabic can confirm words "identical" great.

str1 = u'أكثر' str2 = u'اكثر' 

so i'm trying bring str1 , str2 same form (if possible), want function produce same output both strings:

transform(str1) == transform(str2) 

in english can achieved easily:

a = u'more' b = u'more'  def transform(text):     return text.lower()  >>> transform(a) == transform(b) >>> true 

but, of course, doesn't work arabic there no such things lower case or upper case.

>>> str1 u'\u0623\u0643\u062b\u0631'  >>> str2 u'\u0627\u0643\u062b\u0631' 

note first character differs in unicode representation.

i normalized strings using:

import unicodedata  >>> n_str1 = unicodedata.normalize('nfkd', str1) >>> n_str2 = unicodedata.normalize('nfkd', str2)  >>> n_str1 u'\u0627\u0654\u0643\u062b\u0631'  >>> n_str2 u'\u0627\u0643\u062b\u0631' 

as noticed:

>>> n_str1 == n_str2 false 

after that, tried use unicode.casefold() isn't available in python 2. i've installed py2casefold library didn't manage obtain equality between strings. tried use python 3's unicode.casefold() without luck:

>>> str1.casefold() == str2.casefold() false  >>> n_str1.casefold() == n_str2.casefold() false 

a solution in python 2 perfect, great in python 3 too.

thank you.

these words not identical: u'أكثر' , u'اكثر' not same. first letter in first word has letter alif hamazah on top of it, perhaps couldn't notice due small size of glyph:

alif hamaza

the first letter in second word, however, alif *(from right-to-left):

alif

and hence don't compare equal. each of these letters represented own unicode character code point. don't compare equal perspective of language too:

>>> u'أكثر'; u'اكثر' u'\u0623\u0643\u062b\u0631' u'\u0627\u0643\u062b\u0631' 

they not identical, if lowercase both of them, i'll obtain word vs word, identical. that's want obtain these 3 pairs of arabic words.

there's no lower or upper case in arabic. words have in hands not same, have different letters. of words have correct spelling while others have incorrect spelling. may seem same, arabic readers may consider them same, language freaks, they're not same. convey meaning, list of arabic words in english looks this:

1- more, moore

2- menu, manu

3- establish, estblish

i'm going exemplify tried using first pair (1. أكثر vs اكثر). way, meaning of both arabic words first pair "menu", have different cases (as parallel: menu vs menu)

no, أكثر means more. second pair means menu, there's no such thing menu or menu in arabic. couldn't delve details, because off topic.


Comments

Popular posts from this blog

What is happening when Matlab is starting a "parallel pool"? -

angular - DownloadURL return null in below code -

php - Cannot override Laravel Spark authentication with own implementation -