python - Error 403 : HTTP status code is not handled or not allowed in scrapy -


this code, have written scrape justdial website.

import scrapy scrapy.http.request import request  class justdialspider(scrapy.spider):     name = 'justdial'     # handle_httpstatus_list = [400]     # headers={'user-agent': "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1"}     # handle_httpstatus_list = [403, 404]     allowed_domains = ['justdial.com']     start_urls = ['https://www.justdial.com/delhi-ncr/chemists/page-1']     # def  start_requests(self):     #     # hdef start_requests(self):     #     headers= {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'}     #     url in self.start_urls:     #         self.log("i visited :---------------------------------- "+url)     #         yield request(url, headers=headers)     def parse(self,response):         self.log("i visited site:---------------------------------------------- "+response.url)          urls = response.xpath('//a/@href').extract()          self.log("urls-------: "+str(urls)) 

this error showing in terminal:

2017-08-18 18:32:25 [scrapy.core.engine] info: spider opened 2017-08-18 18:32:25 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2017-08-18 18:32:25 [scrapy.extensions.httpcache] debug: using filesystem cache storage in d:\scrapy\justdial\.scrapy\httpcache 2017-08-18 18:32:25 [scrapy.extensions.telnet] debug: telnet console listening o n 127.0.0.1:6023 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/robots.txt> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.core.engine] debug: crawled (403) <get https://www.j ustdial.com/delhi-ncr/chemists/page-1> (referer: none) ['cached'] 2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] info: ignoring response  <403 https://www.justdial.com/delhi-ncr/chemists/page-1>: http status code n ot handled or not allowed  

i have seen similar questions on stackoverflow tried like, can see in code comment tried,

  • changed useragents

  • setting handle_httpstatus_list = [400]

note: (https://www.justdial.com/delhi-ncr/chemists/page-1) website not blocked in system. when open website in chrome/mozilla, it's opening. same error (https://www.practo.com/bangalore#doctor-search) site also.

when set user agent using user_agent spider attribute, starts work. setting request headers not enough gets overridden default user agent string. set spider attribute

user_agent = "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, gecko) chrome/22.0.1207.1 safari/537.1" 

(the same way set start_urls) , try it.


Comments

  1. Such a great post, thanks for posting this kind of useful information.
    Visit Husband Wife Problem in Adelaide

    ReplyDelete
  2. Thank you so much for sharing this wonderful post with us. For Best Astrology services contact us Best Astrologer in Kuvempunagar Mysore

    ReplyDelete

Post a Comment

Popular posts from this blog

What is happening when Matlab is starting a "parallel pool"? -

php - Cannot override Laravel Spark authentication with own implementation -

Qt QGraphicsScene is not accessable from QGraphicsView (on Qt 5.6.1) -