python 2.7 - bs4 parse all URL across a a given number of archive pages -
i wrote function find article links in each page of given page range, parse links , save content.
however, instead of getting articles, function returning content of first article of each page.
i guess should go down 1 level can't seem find solution myself.
see code below:
from bs4 import beautifulsoup import urllib2 def getsoupfromurl(url): req = urllib2.request(url, headers={'user-agent' : "chrome"}) con = urllib2.urlopen(req) page = con.read() soup = beautifulsoup(page, "lxml") con.close() return soup base_url = 'http://monitor.icef.com/' articles = [] num_pages = 10 in range(num_pages): url = 'http://monitor.icef.com/page/{}'.format(i) soup = getsoupfromurl(url) print("..{}".format(i)) link in soup.find_all("div", class_="article-header")[0].find_all("a"): soup = getsoupfromurl(link.get("href")) paras = soup.find_all("div", class_="article-content")[0].find_all("p") content = " ".join([para.gettext() para in paras]) articles.append(content)
Comments
Post a Comment