python 2.7 - bs4 parse all URL across a a given number of archive pages -

August 15, 2011

i wrote function find article links in each page of given page range, parse links , save content.

however, instead of getting articles, function returning content of first article of each page.

i guess should go down 1 level can't seem find solution myself.

see code below:

from bs4 import beautifulsoup import urllib2  def getsoupfromurl(url):     req = urllib2.request(url, headers={'user-agent' : "chrome"})      con = urllib2.urlopen(req)     page = con.read()     soup = beautifulsoup(page, "lxml")     con.close()     return soup  base_url = 'http://monitor.icef.com/' articles = [] num_pages = 10 in range(num_pages):     url = 'http://monitor.icef.com/page/{}'.format(i)     soup = getsoupfromurl(url)     print("..{}".format(i))     link in soup.find_all("div", class_="article-header")[0].find_all("a"):     soup = getsoupfromurl(link.get("href"))     paras = soup.find_all("div", class_="article-content")[0].find_all("p")     content = " ".join([para.gettext() para in paras])     articles.append(content)

Search This Blog

How Y

python 2.7 - bs4 parse all URL across a a given number of archive pages -

Comments

Post a Comment

Popular posts from this blog

meteor - inserting data to database gives error "insert failed: Method '/texts/insert' not found" -

angular - DownloadURL return null in below code -

html - unterminated string literal “onclick” event in anchor -