web scraping - Why Scrapy hash url when downloading image? -


why scrapy hash url when downloading image, seen here https://doc.scrapy.org/en/latest/topics/media-pipeline.html, sha1 used hash url give name each image? there practical advantage of doing this?

i don't think there major advantages storing files sha1 hashed urls.
there few minor advantages though:

  • getting rid of unsafe characters - characters such /:. not system safe, having filename word characters + .jpg convenient.
  • contstant length (might useful in rare cases).
  • easy validate dupe filterting , such same url has out same filename.

personally think it's pretty lazy solution. fortunately can extended, though it's not straight-forward should be.

class myimagespipeline(imagespipeline):     def filename(url):         return url.replace(string.letters + '-_.', '')      def file_path(self, request, response=none, info=none):         # original code         # image_guid = hashlib.sha1(to_bytes(url)).hexdigest()         # return 'full/%s.jpg' % (image_guid)         # our code         return 'full/' + self.filename(response.url)      def thumb_path(self, request, response=none, info=none):         return 'thumb/' + self.filename(response.url) 

and enable in settings.py


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -