python - Certain Korean characters are showing up as question marks/diamonds when scraping, how can I fix this? -

January 15, 2013

i scraping text in korean , part 99.9% of characters show rest below.

�z

for example, should scraping "고소를해줫어", in output it's giving me "고소를해�z어".

i know encoding issue, not know how can fix that. i've read can use .encode('utf-8') did not solve it.

any appreciated!

full code added context (beginner programmer please excuse messy code!):

import bs4 bs import requests  raw_link = input("enter article's url: ") article_id = raw_link[26:40] source = "http://comm.news.nate.com/comment/articlecomment/list?artc_sq=" + article_id + "&prebest=0&order=o&mid=n1008&domain=&arglist=0" headers = {'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_11_6) applewebkit/537.36 (khtml, gecko) chrome/53.0.2785.143 safari/537.36'} r = requests.get(source, headers = headers) html = r.text soup = bs.beautifulsoup(html, 'lxml')  upvotes = [] downvotes = [] comment_list = [] user_list = [] numbered_list = [1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.] raw_numbered_list = list(map(int, numbered_list))  url in soup.select('strong[name*="cmt_o_cnt_"]')[3:]:     raw_numbers_up = url.text.strip()     upvotes.append(raw_numbers_up)  url in soup.select('strong[name*="cmt_x_cnt_"]')[3:]:     raw_numbers_down = url.text.strip()     downvotes.append(raw_numbers_down)  url in soup.find_all('dd', class_="usertxt")[3:]:     comments = url.text.strip()     comment_list.append(comments)  url in soup.find_all('span', {'class':['nameui', 't']})[3:]:     user_id = url.text.strip()     user_list.append(user_id)  results = list(zip(raw_numbered_list, upvotes, downvotes, user_list, comment_list)) number, upvote, downvote, user, comment in results:     replies = ("\n{}. [+{}, -{}] {}:\n{}".format(number, upvote, downvote, user, comment))     print(replies)

edit 1: i've tested same code on laptop , i'm still running same issue! if else wants check if same issue, change entire string in source variable near top of code "http://comm.news.nate.com/comment/articlecomment/list?artc_sq=20170818n20195&prebest=0&order=o&mid=n1008&domain=&arglist=0" , see if it.

edit 2: possibly user-agent i'm using?

edit 3: i'm it's euc-kr , utf-8 issue. page i'm scraping encoded in euc-kr feeling there's in code conflicting how text being read.

edit: 4 ran chardet module using page scraping , said encoding cp949, not euc-kr had thought. also, tested code in spyder instead of pycharm: same issue occurs.

it looks weird error. of hangul characters correctly displayed, cannot simple encoding problem. looks more strange example replaces syllable jweos ('줫' u+c92b) replacement character ('�') followed latin capital letter z ('z' u+005a). , cannot imagine z can come from, none of encodings know can convert 0xc92b in followed 0x5a.

i can imagine data corruption.

Search This Blog

How Y

python - Certain Korean characters are showing up as question marks/diamonds when scraping, how can I fix this? -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -