python - Scrapy Crawling Speed is so Slow (6 pages / min)! -
i new scrapy , build project using scrapy startproject zhanlang
. when start spider scrapy crawl zhanlang -o zhanlang.csv
,it works slowly!only 6pags/min! here code:
def after_login(self, response): #the site should log in,this function todo after login yield request(url="https://movie.douban.com/subject/26363254/comments?start=0&limit=20&sort=new_score&status=p", meta={'cookiejar': response.meta['cookiejar']}, callback=self.parse ) def parse(self,response): item = zhanlangitem() comment in response.xpath('//div[@class="comment-item"]'): item['name'] = comment.xpath('./div[@class="avatar"]/a/@title').extract_first() item['text'] = comment.xpath('./div[@class="comment"]/p/text()').extract() item['vote'] = comment.xpath('.//span[@class="votes"]/text()').extract_first() yield item next_page_url = response.xpath('//a[@class="next"]/@href').extract()[0] next_page_url = "https://movie.douban.com/subject/26363254/comments"+next_page_url if next_page_url not none: print next_page_url yield request(url=next_page_url, meta={'cookiejar': response.meta['cookiejar']}, callback=self.parse )
here settings:
download_delay = 0.5 # download delay setting honor 1 of: concurrent_requests_per_domain = 16 #concurrent_requests_per_ip = 16 downloader_middlewares = { 'scrapy.downloadermiddlewares.useragent.useragentmiddleware':none, 'zhanlang.middlewares.randomuseragentmiddleware':400, }
my middlewares.py are:
from fake_useragent import useragent import requests, random, json import base64 class randomuseragentmiddleware(object): # random choice useragent def __init__(self, crawler): super(randomuseragentmiddleware, self).__init__() self.ua = useragent() self.ua_type = crawler.settings.get('random_ua_type', 'random') @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def get_ua(): return getattr(self.ua, self.ua_type) request.headers.setdefault('user-agent', get_ua())
why crawls slowly? should increase speed?thanks
Comments
Post a Comment