python - Error 403 : HTTP status code is not handled or not allowed in scrapy -


this code, have written scrape justdial website.

import scrapy scrapy.http.request import request  class justdialspider(scrapy.spider):     name = 'justdial'     # handle_httpstatus_list = [400]     # headers={'user-agent': "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1"}     # handle_httpstatus_list = [403, 404]     allowed_domains = ['justdial.com']     start_urls = ['https://www.justdial.com/delhi-ncr/chemists/page-1']     # def  start_requests(self):     #     # hdef start_requests(self):     #     headers= {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'}     #     url in self.start_urls:     #         self.log("i visited :---------------------------------- "+url)     #         yield request(url, headers=headers)     def parse(self,response):         self.log("i visited site:---------------------------------------------- "+response.url)          urls = response.xpath('//a/@href').extract()          self.log("urls-------: "+str(urls)) 

this error showing in terminal:

2017-08-18 18:32:25 [scrapy.core.engine] info: spider opened 2017-08-18 18:32:25 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2017-08-18 18:32:25 [scrapy.extensions.httpcache] debug: using filesystem cache storage in d:\scrapy\justdial\.scrapy\httpcache 2017-08-18 18:32:25 [scrapy.extensions.telnet] debug: telnet console listening o n 127.0.0.1:6023 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/robots.txt> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/delhi-ncr/chemists/page-1> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] info: ignoring response  <403 https://www.justdial.com/delhi-ncr/chemists/page-1>: http status code n ot handled or not allowed  

i have seen similar questions on stackoverflow tried like, can see in code comment tried,

  • changed useragents

  • setting handle_httpstatus_list = [400]

note: (https://www.justdial.com/delhi-ncr/chemists/page-1) website not blocked in system. when open website in chrome/mozilla, it's opening. same error (https://www.practo.com/bangalore#doctor-search) site also.

when set user agent using user_agent spider attribute, starts work. setting request headers not enough gets overridden default user agent string. set spider attribute

user_agent = "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1" 

(the same way set start_urls) , try it.


Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -