python - Scrapy returns repeated out of order results when using a for loop, but not when going link by link -

August 15, 2012

i attempting use scrapy crawl site. here code:

import scrapy  class articlespider(scrapy.spider):     name = "article"     start_urls = [         'http://www.irna.ir/en/services/161',     ]      def parse(self, response):         linknum in range(1, 15):             next_article = response.xpath('//*[@id="newsimageverticalitems"]/div[%d]/div[2]/h3/a/@href' % linknum).extract_first()             next_article = response.urljoin(next_article)             yield scrapy.request(next_article)               text in response.xpath('//*[@id="ctl00_ctl00_contentplaceholder_contentplaceholder_newscontent4_bodylabel"]'):                 yield {                     'article': text.xpath('./text()').extract()                         }              tag in response.xpath('//*[@id="ctl00_ctl00_contentplaceholder_contentplaceholder_newscontent4_bodytext"]'):                 yield {                     'tag1': tag.xpath('./div[3]/p[1]/a/text()').extract(),                     'tag2': tag.xpath('./div[3]/p[2]/a/text()').extract(),                     'tag3': tag.xpath('./div[3]/p[3]/a/text()').extract(),                     'tag4': tag.xpath('./div[3]/p[4]/a/text()').extract()                         }             yield response.follow('http://www.irna.ir/en/services/161', callback=self.parse)

but returns in json weird mixture of repeated items, out of order , skipping links: https://pastebin.com/lvkjhrrt

however, when set linknum single number, code works fine.

why iterating changing results?

as @tarunlalwani stated, current approach not right. should:

in parse method, extract links articles on page , yield requests scraping them callback named e.g. parse_article.
still in parse method, check button loading more articles present , if so, yield request url of pattern http://www.irna.ir/en/services/161/pagen. (this can found in browser's developer tools under xhr requests on network tab.)
define parse_article method extract article text , tags details page , yield item.

below final spider:

import scrapy  class irnaspider(scrapy.spider):     name = 'irna'     base_url = 'http://www.irna.ir/en/services/161'      def start_requests(self):         yield scrapy.request(self.base_url, meta={'page_number': 1})      def parse(self, response):         article_url in response.css('.datalistcontainer h3 a::attr(href)').extract():             yield scrapy.request(response.urljoin(article_url), callback=self.parse_article)          page_number = response.meta['page_number'] + 1         if response.css('#morebutton'):             yield scrapy.request('{}/page{}'.format(self.base_url, page_number),                                  callback=self.parse, meta={'page_number': page_number})      def parse_article(self, response):         yield {             'text': ' '.join(response.xpath('//p[@id="ctl00_ctl00_contentplaceholder_contentplaceholder_newscontent4_bodylabel"]/text()').extract()),             'tags': [tag.strip() tag in response.xpath('//div[@class="tags"]/p/a/text()').extract() if tag.strip()]         }

Search This Blog

How Y

python - Scrapy returns repeated out of order results when using a for loop, but not when going link by link -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

What is happening when Matlab is starting a "parallel pool"? -

php - Cannot override Laravel Spark authentication with own implementation -