python - Error 403 : HTTP status code is not handled or not allowed in scrapy -

January 15, 2011

this code, have written scrape justdial website.

import scrapy scrapy.http.request import request  class justdialspider(scrapy.spider):     name = 'justdial'     # handle_httpstatus_list = [400]     # headers={'user-agent': "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1"}     # handle_httpstatus_list = [403, 404]     allowed_domains = ['justdial.com']     start_urls = ['https://www.justdial.com/delhi-ncr/chemists/page-1']     # def  start_requests(self):     #     # hdef start_requests(self):     #     headers= {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'}     #     url in self.start_urls:     #         self.log("i visited :---------------------------------- "+url)     #         yield request(url, headers=headers)     def parse(self,response):         self.log("i visited site:---------------------------------------------- "+response.url)          urls = response.xpath('//a/@href').extract()          self.log("urls-------: "+str(urls))

this error showing in terminal:

2017-08-18 18:32:25 [scrapy.core.engine] info: spider opened 2017-08-18 18:32:25 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2017-08-18 18:32:25 [scrapy.extensions.httpcache] debug: using filesystem cache storage in d:\scrapy\justdial\.scrapy\httpcache 2017-08-18 18:32:25 [scrapy.extensions.telnet] debug: telnet console listening o n 127.0.0.1:6023 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/robots.txt> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/delhi-ncr/chemists/page-1> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] info: ignoring response  <403 https://www.justdial.com/delhi-ncr/chemists/page-1>: http status code n ot handled or not allowed

i have seen similar questions on stackoverflow tried like, can see in code comment tried,

changed useragents
setting handle_httpstatus_list = [400]

note: (https://www.justdial.com/delhi-ncr/chemists/page-1) website not blocked in system. when open website in chrome/mozilla, it's opening. same error (https://www.practo.com/bangalore#doctor-search) site also.

when set user agent using user_agent spider attribute, starts work. setting request headers not enough gets overridden default user agent string. set spider attribute

user_agent = "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1"

(the same way set start_urls) , try it.

Comments

Panditragudeva25 February 2022 at 02:29
Such a great post, thanks for posting this kind of useful information.
Visit Husband Wife Problem in Adelaide

ReplyDelete
Replies

Add comment

Search This Blog

How Y

python - Error 403 : HTTP status code is not handled or not allowed in scrapy -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -