python - In what 8-bit character set is 0x9d meaningful? -

July 15, 2015

in 8-bit ascii-like character set english 0x9d meaningful? i'm cleaning old data files, , finding 0x9d in otherwise-ascii text. (no, it's not utf-8.)

it's not valid in windows-1252. python "latin-1" codec translates unicode 0x9d, "operating system command". makes little sense. in unicode box [009d]. (in python, can convert latin-1 without errors being raised, doesn't mean it's meaningful so.)

examples, python-type escapes, messy database i'm cleaning combines text many sources:

guitar pro, jamplay, redbana\\\'s audition,\x9d doppleganger\x99s lounge\x9d or heatwave interactive\x99s platinum life country,\\"  example \\"i\\\'ve seen bull run in pamplona, spain\x9d.\\"  netwise depot  \\"one stop web shop\\"\x9d provides sustainable \\"green\\"\x9d living  looking \\"do me\\"\x9d solution

from context, i'd suspect ™ or ®. 8-bit code had those?

here's wild hypothesis:

some prior (really broken) system working on data attempted write each character utf-8, wrote last byte of each sequence (maybe had weird one-byte-long buffer somewhere). alternatively, in utf-8 in past, viewing in different encoding did search-and-replace remove bytes 0xe2 0x80 because "didn't belong" , didn't realize remaining "special character" wasn't 1 wanted either.

ascii, of course, passed through utf-8 encoding 1 byte long.

the 'right single quotation mark' (u+2019) ’ encoded in utf-8 bytes 0xe2 0x80 0x99. places have \x99s made me go down path, since apostrophe before s translated right curly quotation mark in popular word processing software. if last byte of character saved, you'd have 0x99 there.

the 'right double quotation mark' (u+201d) ” encoded in utf-8 bytes 0xe2 0x80 0x9d. 0x9d have in text @ end of double-quoted string. and, it's right next regular straight " double-quote. wonder if had tried sort of prior clean-up pass on data, , managed put in closing quote, left "weird" 0x9d in there.

as said, it's wild hypothesis, if conglomeration of data variety of old systems, it's hard know may have happened it. last byte of utf-8 closest "normal" english encoding find have reasonable in english text , included bytes looking for.

Search This Blog

How Y

python - In what 8-bit character set is 0x9d meaningful? -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -