html - Python Selenium Scraping Crashes, Can I Find Elements For Part of The Web Page? -
i trying scrape data website. website has 'load more products' button. i'm using:
driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click()
to hit button , loops set number of iterations.
the problem i'm running once number of iterations have been completed, want extract text webpage using:
posts = driver.find_elements_by_class_name("hotproductdetails")
however, seems crash chrome, , can no data out. i'd do, populate posts new products have loaded after each iteration.
after 'load more' has been clicked, want grab text 50 products have loaded, append list , continue.
i can run line posts = driver.find_elements_by_class_name("hotproductdetails")
within each iteration, grabs every element on page every time, , slows down process.
is there anyway of achieving in selenium or limited using library?
this full script:
import csv import time selenium import webdriver import pandas pd def cexscrape(): print('loading chrome...') chromepath = r"c:\users\leonk\documents\python scripts\chromedriver.exe" driver = webdriver.chrome(chromepath) driver.get(url) print('prepping webpage...') time.sleep(2) driver.execute_script("window.scrollto(0, document.body.scrollheight);") y = 0 breakclause = exceptcheck = false while y < 1000 , breakclause == false: y += 1 time.sleep(0.5) try: driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click() exceptcheck = false print('load count', y, '...') except: if exceptcheck: breakclause = true else: exceptcheck = true print('load count', y, '...lag...') time.sleep(2) continue print('grabbing elements...') posts = driver.find_elements_by_class_name("hotproductdetails") cats = driver.find_elements_by_class_name("supercatlink") print('generating lists...') catlist = [] postlist = [] cat in cats: catlist.append(cat.text) print('categories complete...') post in posts: postlist.append(post.text) print('products complete...') return postlist, catlist prods, cats = cexscrape() print('extracting lists...') cat = [] subcat = [] prodname = [] sellprice = [] buycash = [] buyvoucher = [] c in cats: cat.append(c.split('/')[0]) subcat.append(c.split('/')[1]) p in prods: prodname.append(p.split('\n')[0]) sellprice.append(p.split('\n')[2]) if 'webuy' in p: buycash.append(p.split('\n')[4]) buyvoucher.append(p.split('\n')[6]) else: buycash.append('nan') buyvoucher.append('nan') print('generating dataframe...') df = pd.dataframe( {'category' : cat, 'sub category' : subcat, 'product name' : prodname, 'sell price' : sellprice, 'cash buy price' : buycash, 'voucher buy price' : buyvoucher}) print('writing csv...') df.to_csv('data.csv', sep=',', encoding='utf-8') print('completed!')
use xpath , limit products get. if 50 products each time use below
"(//div[@class='hotproductdetails'])[position() > {} , position() <= {}])".format ((page -1 ) * 50, page * 50)
this give 50
products every time , increase page # next lot. doing in 1 go anyways crash it
Comments
Post a Comment