How Do I Compare Characters With Combining Diacritic Marks Ɔ̃, Ɛ̃ And Ɑ̃ To Unaccented Ones In Python (imported From A Utf-8 Encoded Text File)?
Summary: I want to compare ɔ̃, ɛ̃ and ɑ̃ to ɔ, ɛ and a, which are all different, but my text file has ɔ̃, ɛ̃ and ɑ̃ written as ɔ~, ɛ~ and a~. I wrote a script whi
Solution 1:
Unicode normalization does not help for described particular character combinations because an excerpt from Unicode database UnicodeData.Txt
using simple regex "Latin.*Letter.*with tilde$"
gives ÃÑÕãñõĨĩŨũṼṽẼẽỸỹ
(no Latin letters Open O
, Open E
or Alpha
). So you need to iterate through both compared strings separately as follows (omitted most of your code above a Minimal, Reproducible Example):
import unicodedata
deflens(word):
returnlen(word)
input_lines = ['alyʁ/alɔʁ', 'ɑ̃bisjø/ɑ̃bisjɔ̃ ', 'osi/ɛ̃si', 'bɛ̃ /bɔ̃ ', 'bo/ba', 'bjɛ/bjɛ̃ ']
print(len(input_lines))
for line in input_lines:
print('')
#find word ipa transctipts
line = unicodedata.normalize('NFKC', line.rstrip('\n'))
line = line.split("/")
line.sort(key = lens)
word1, word2 = line[0:2] # the shortest two strings after splitting are the ipa words
index = i1 = i2 = 0while i1 < len(word1) and i2 < len(word2):
letter1 = word1[i1]
i1 += 1if i1 < len(word1) and unicodedata.category(word1[i1]) == 'Mn':
letter1 += word1[i1]
i1 += 1
letter2 = word2[i2]
i2 += 1if i2 < len(word2) and unicodedata.category(word2[i2]) == 'Mn':
letter2 += word2[i2]
i2 += 1
same = chr(0xA0) if letter1 == letter2 else'#'print(index, same, word1, word2, letter1, letter2)
index += 1#if same != chr(0xA0):# break
Output: .\SO\67335977.py
60 alyʁ alɔʁ aa1 alyʁ alɔʁ l l
2 # alyʁ alɔʁ y ɔ
3 alyʁ alɔʁ ʁ ʁ
0 ɑ̃bisjø ɑ̃bisjɔ̃ ɑ̃ ɑ̃
1 ɑ̃bisjø ɑ̃bisjɔ̃ bb2 ɑ̃bisjø ɑ̃bisjɔ̃ ii3 ɑ̃bisjø ɑ̃bisjɔ̃ s s
4 ɑ̃bisjø ɑ̃bisjɔ̃ j j
5 # ɑ̃bisjø ɑ̃bisjɔ̃ ø ɔ̃
0 # osi ɛ̃si o ɛ̃
1 osi ɛ̃si s s
2 osi ɛ̃si ii0 bɛ̃ bɔ̃ bb1 # bɛ̃ bɔ̃ ɛ̃ ɔ̃
2 bɛ̃ bɔ̃
0 bo ba bb1 # bo ba o a0 bjɛ bjɛ̃ bb1 bjɛ bjɛ̃ j j
2 # bjɛ bjɛ̃ ɛ ɛ̃
Note: diacritic tested as Unicode category Mn
; you can test against another condition (e.g. from the following list):
Mn Nonspacing_Mark:
a nonspacing combining mark (zero advance width)Mc Spacing_Mark :
a spacing combining mark (positive advance width)Me Enclosing_Mark :
an enclosing combining markM Mark :
Mn | Mc | Me
Solution 2:
I am in the process of solving this by just doing a find and replace on these characters before processing it and a reverse find and replace when I'm done.
Post a Comment for "How Do I Compare Characters With Combining Diacritic Marks Ɔ̃, Ɛ̃ And Ɑ̃ To Unaccented Ones In Python (imported From A Utf-8 Encoded Text File)?"