python - Certain Korean characters are showing up as question marks/diamonds when scraping, how can I fix this? -
i scraping text in korean , part 99.9% of characters show rest below.
�z
for example, should scraping "고소를해줫어", in output it's giving me "고소를해�z어".
i know encoding issue, not know how can fix that. i've read can use .encode('utf-8')
did not solve it.
any appreciated!
full code added context (beginner programmer please excuse messy code!):
import bs4 bs import requests raw_link = input("enter article's url: ") article_id = raw_link[26:40] source = "http://comm.news.nate.com/comment/articlecomment/list?artc_sq=" + article_id + "&prebest=0&order=o&mid=n1008&domain=&arglist=0" headers = {'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_11_6) applewebkit/537.36 (khtml, gecko) chrome/53.0.2785.143 safari/537.36'} r = requests.get(source, headers = headers) html = r.text soup = bs.beautifulsoup(html, 'lxml') upvotes = [] downvotes = [] comment_list = [] user_list = [] numbered_list = [1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.] raw_numbered_list = list(map(int, numbered_list)) url in soup.select('strong[name*="cmt_o_cnt_"]')[3:]: raw_numbers_up = url.text.strip() upvotes.append(raw_numbers_up) url in soup.select('strong[name*="cmt_x_cnt_"]')[3:]: raw_numbers_down = url.text.strip() downvotes.append(raw_numbers_down) url in soup.find_all('dd', class_="usertxt")[3:]: comments = url.text.strip() comment_list.append(comments) url in soup.find_all('span', {'class':['nameui', 't']})[3:]: user_id = url.text.strip() user_list.append(user_id) results = list(zip(raw_numbered_list, upvotes, downvotes, user_list, comment_list)) number, upvote, downvote, user, comment in results: replies = ("\n{}. [+{}, -{}] {}:\n{}".format(number, upvote, downvote, user, comment)) print(replies)
edit 1: i've tested same code on laptop , i'm still running same issue! if else wants check if same issue, change entire string in source
variable near top of code "http://comm.news.nate.com/comment/articlecomment/list?artc_sq=20170818n20195&prebest=0&order=o&mid=n1008&domain=&arglist=0"
, see if it.
edit 2: possibly user-agent
i'm using?
edit 3: i'm it's euc-kr
, utf-8
issue. page i'm scraping encoded in euc-kr
feeling there's in code conflicting how text being read.
edit: 4 ran chardet
module using page scraping , said encoding cp949
, not euc-kr
had thought. also, tested code in spyder instead of pycharm: same issue occurs.
it looks weird error. of hangul characters correctly displayed, cannot simple encoding problem. looks more strange example replaces syllable jweos ('줫' u+c92b) replacement character ('�') followed latin capital letter z ('z' u+005a). , cannot imagine z
can come from, none of encodings know can convert 0xc92b in followed 0x5a.
i can imagine data corruption.
Comments
Post a Comment