python - extracting a json response in scrapy -
how use scrapy scrape api uses json format? json looks this:
"records": [ { "uri": "https://www.example.com", "access": { "update": false }, "id": 17059, "vid": 37614, "name": "mylibery", "claim": null, "claimedby": null, "authoruid": "3", "lifecycle": "l", "companytype": "s", "ugcstate": 10, "companylogo": { "filename": "mylibery-logo.png", "filepath": "sites/default/files/imagecache/company_logo_70/mylibery-logo.png" }
i tried code:
import scrapy import json class apiitem(scrapy.item): url = scrapy.field() name = scrapy.field() class examplespider(scrapy.spider): name = 'api' allowed_domains = ["site.com"] start_urls = [l.strip() l in open('pages.txt').readlines()] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) jsonresponse = json.loads(response.body_as_unicode()) item = apiitem() item["url"] = jsonresponse["uri"] item["name"] = jsonresponse["name"] return item
"pages.txt" list of api pages want scrape , want extract "uri" , "name" , save csv.
but throws error saying:
2017-08-18 13:23:02 [scrapy] error: spider error processing <get https://www.investiere.ch/proxy/api2/v1/companies?extra%5bimagecache%5d=company_logo_70&fields=companytype,lifecycle&page=8¶meters%5binclude_skipped%5d=yes> (referer: none) traceback (most recent call last): file "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 651, in _runcallbacks current.result = callback(current.result, *args, **kw) file "/home/habenn/projects/inapi/inapi/spiders/example.py", line 22, in parse item["url"] = jsonresponse["uri"] keyerror: 'uri'
from example given, should this:
item["url"] = jsonresponse["records"][0]["uri"] item["name"] = jsonresponse["records"][0]["name"]
edit:
to uri
s , name
s response, use this:
def parse(self, response): ... record in jsonresponse["records"]: item = apiitem() item["url"] = record["uri"] item["name"] = record["name"] yield item
note particularly replacing return
yield
.
Comments
Post a Comment