python - Error 403 : HTTP status code is not handled or not allowed in scrapy -
this code, have written scrape justdial website.
import scrapy scrapy.http.request import request class justdialspider(scrapy.spider): name = 'justdial' # handle_httpstatus_list = [400] # headers={'user-agent': "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1"} # handle_httpstatus_list = [403, 404] allowed_domains = ['justdial.com'] start_urls = ['https://www.justdial.com/delhi-ncr/chemists/page-1'] # def start_requests(self): # # hdef start_requests(self): # headers= {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'} # url in self.start_urls: # self.log("i visited :---------------------------------- "+url) # yield request(url, headers=headers) def parse(self,response): self.log("i visited site:---------------------------------------------- "+response.url) urls = response.xpath('//a/@href').extract() self.log("urls-------: "+str(urls))
this error showing in terminal:
2017-08-18 18:32:25 [scrapy.core.engine] info: spider opened 2017-08-18 18:32:25 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2017-08-18 18:32:25 [scrapy.extensions.httpcache] debug: using filesystem cache storage in d:\scrapy\justdial\.scrapy\httpcache 2017-08-18 18:32:25 [scrapy.extensions.telnet] debug: telnet console listening o n 127.0.0.1:6023 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/robots.txt> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/delhi-ncr/chemists/page-1> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] info: ignoring response <403 https://www.justdial.com/delhi-ncr/chemists/page-1>: http status code n ot handled or not allowed
i have seen similar questions on stackoverflow tried like, can see in code comment tried,
changed useragents
setting handle_httpstatus_list = [400]
note: (https://www.justdial.com/delhi-ncr/chemists/page-1) website not blocked in system. when open website in chrome/mozilla, it's opening. same error (https://www.practo.com/bangalore#doctor-search) site also.
when set user agent using user_agent
spider attribute, starts work. setting request headers not enough gets overridden default user agent string. set spider attribute
user_agent = "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1"
(the same way set start_urls
) , try it.
Such a great post, thanks for posting this kind of useful information.
ReplyDeleteVisit Husband Wife Problem in Adelaide