python - Scrapy returns repeated out of order results when using a for loop, but not when going link by link -
i attempting use scrapy crawl site. here code:
import scrapy class articlespider(scrapy.spider): name = "article" start_urls = [ 'http://www.irna.ir/en/services/161', ] def parse(self, response): linknum in range(1, 15): next_article = response.xpath('//*[@id="newsimageverticalitems"]/div[%d]/div[2]/h3/a/@href' % linknum).extract_first() next_article = response.urljoin(next_article) yield scrapy.request(next_article) text in response.xpath('//*[@id="ctl00_ctl00_contentplaceholder_contentplaceholder_newscontent4_bodylabel"]'): yield { 'article': text.xpath('./text()').extract() } tag in response.xpath('//*[@id="ctl00_ctl00_contentplaceholder_contentplaceholder_newscontent4_bodytext"]'): yield { 'tag1': tag.xpath('./div[3]/p[1]/a/text()').extract(), 'tag2': tag.xpath('./div[3]/p[2]/a/text()').extract(), 'tag3': tag.xpath('./div[3]/p[3]/a/text()').extract(), 'tag4': tag.xpath('./div[3]/p[4]/a/text()').extract() } yield response.follow('http://www.irna.ir/en/services/161', callback=self.parse)
but returns in json weird mixture of repeated items, out of order , skipping links: https://pastebin.com/lvkjhrrt
however, when set linknum single number, code works fine.
why iterating changing results?
as @tarunlalwani stated, current approach not right. should:
- in
parse
method, extract links articles on page , yield requests scraping them callback named e.g.parse_article
. - still in
parse
method, check button loading more articles present , if so, yield request url of pattern http://www.irna.ir/en/services/161/pagen. (this can found in browser's developer tools under xhr requests on network tab.) - define
parse_article
method extract article text , tags details page , yield item.
below final spider:
import scrapy class irnaspider(scrapy.spider): name = 'irna' base_url = 'http://www.irna.ir/en/services/161' def start_requests(self): yield scrapy.request(self.base_url, meta={'page_number': 1}) def parse(self, response): article_url in response.css('.datalistcontainer h3 a::attr(href)').extract(): yield scrapy.request(response.urljoin(article_url), callback=self.parse_article) page_number = response.meta['page_number'] + 1 if response.css('#morebutton'): yield scrapy.request('{}/page{}'.format(self.base_url, page_number), callback=self.parse, meta={'page_number': page_number}) def parse_article(self, response): yield { 'text': ' '.join(response.xpath('//p[@id="ctl00_ctl00_contentplaceholder_contentplaceholder_newscontent4_bodylabel"]/text()').extract()), 'tags': [tag.strip() tag in response.xpath('//div[@class="tags"]/p/a/text()').extract() if tag.strip()] }
Comments
Post a Comment