web scraping - Why Scrapy hash url when downloading image? -
why scrapy hash url when downloading image, seen here https://doc.scrapy.org/en/latest/topics/media-pipeline.html, sha1 used hash url give name each image? there practical advantage of doing this?
i don't think there major advantages storing files sha1 hashed urls.
there few minor advantages though:
- getting rid of unsafe characters - characters such
/:.
not system safe, having filename word characters +.jpg
convenient. - contstant length (might useful in rare cases).
- easy validate dupe filterting , such same url has out same filename.
personally think it's pretty lazy solution. fortunately can extended, though it's not straight-forward should be.
class myimagespipeline(imagespipeline): def filename(url): return url.replace(string.letters + '-_.', '') def file_path(self, request, response=none, info=none): # original code # image_guid = hashlib.sha1(to_bytes(url)).hexdigest() # return 'full/%s.jpg' % (image_guid) # our code return 'full/' + self.filename(response.url) def thumb_path(self, request, response=none, info=none): return 'thumb/' + self.filename(response.url)
and enable in settings.py
Comments
Post a Comment