web scraping - Why Scrapy hash url when downloading image? -

September 15, 2013

why scrapy hash url when downloading image, seen here https://doc.scrapy.org/en/latest/topics/media-pipeline.html, sha1 used hash url give name each image? there practical advantage of doing this?

i don't think there major advantages storing files sha1 hashed urls.
there few minor advantages though:

getting rid of unsafe characters - characters such /:. not system safe, having filename word characters + .jpg convenient.
contstant length (might useful in rare cases).
easy validate dupe filterting , such same url has out same filename.

personally think it's pretty lazy solution. fortunately can extended, though it's not straight-forward should be.

class myimagespipeline(imagespipeline):     def filename(url):         return url.replace(string.letters + '-_.', '')      def file_path(self, request, response=none, info=none):         # original code         # image_guid = hashlib.sha1(to_bytes(url)).hexdigest()         # return 'full/%s.jpg' % (image_guid)         # our code         return 'full/' + self.filename(response.url)      def thumb_path(self, request, response=none, info=none):         return 'thumb/' + self.filename(response.url)

and enable in settings.py

Search This Blog

How Y

web scraping - Why Scrapy hash url when downloading image? -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -