python - Error 403 : HTTP status code is not handled or not allowed in scrapy -
this code, have written scrape justdial website.
import scrapy scrapy.http.request import request class justdialspider(scrapy.spider): name = 'justdial' # handle_httpstatus_list = [400] # headers={'user-agent': "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1"} # handle_httpstatus_list = [403, 404] allowed_domains = ['justdial.com'] start_urls = ['https://www.justdial.com/delhi-ncr/chemists/page-1'] # def start_requests(self): # # hdef start_requests(self): # headers= {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'} # url in self.start_urls: # self.log("i visited :---------------------------------- "+url) # yield request(url, headers=headers) def parse(self,response): self.log("i visited site:---------------------------------------------- "+response.url) urls = response.xpath('//a/@href').extract() self.log("urls-------: "+str(urls)) this error showing in terminal:
2017-08-18 18:32:25 [scrapy.core.engine] info: spider opened 2017-08-18 18:32:25 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2017-08-18 18:32:25 [scrapy.extensions.httpcache] debug: using filesystem cache storage in d:\scrapy\justdial\.scrapy\httpcache 2017-08-18 18:32:25 [scrapy.extensions.telnet] debug: telnet console listening o n 127.0.0.1:6023 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/robots.txt> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/delhi-ncr/chemists/page-1> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] info: ignoring response <403 https://www.justdial.com/delhi-ncr/chemists/page-1>: http status code n ot handled or not allowed i have seen similar questions on stackoverflow tried like, can see in code comment tried,
changed useragents
setting handle_httpstatus_list = [400]
note: (https://www.justdial.com/delhi-ncr/chemists/page-1) website not blocked in system. when open website in chrome/mozilla, it's opening. same error (https://www.practo.com/bangalore#doctor-search) site also.
when set user agent using user_agent spider attribute, starts work. setting request headers not enough gets overridden default user agent string. set spider attribute
user_agent = "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1" (the same way set start_urls) , try it.
Such a great post, thanks for posting this kind of useful information.
ReplyDeleteVisit Husband Wife Problem in Adelaide
Thank you so much for sharing this wonderful post with us. For Best Astrology services contact us Best Astrologer in Kuvempunagar Mysore
ReplyDelete