java - How to select two (or more) HTML elements that exist at the same tree level with Jsoup? -
i'm working on project , faced problem. need scrape data website contains following html code:
<div class="lin-curso" style="border: 0;"> <div class="lin-area-c3"> vagas 2017 </div> </div> <div class="box10"> <div class="lin-area-c1"> l160 </div> <div class="lin-area-c2"> acupuntura </div> <div class="lin-area-c3"> [lic-1º cic] </div> </div> <div class="lin-curso"> <div class="lin-curso-c1"> </div> <div class="lin-curso-c2"> 3155 </div> <div class="lin-curso-c3"> <a href="detcursopi.asp?codc=l160&code=3155" title="3155/l160">instituto politécnico de setúbal - escola superior de saúde</a> </div> <div class="lin-curso-c4"> 20 </div> </div> <br> <div class="box10"> <div class="lin-area-c1"> 9059 </div> <div class="lin-area-c2"> administração e gestão de empresas </div> <div class="lin-area-c3"> [lic-1º cic] </div> </div> <div class="lin-curso"> <div class="lin-curso-c1"> </div> <div class="lin-curso-c2"> 2270 </div> <div class="lin-curso-c3"> <a href="detcursopi.asp?codc=9059&code=2270" title="2270/9059">universidade católica portuguesa - faculdade de ciências económicas e empresariais</a> </div> <div class="lin-curso-c4"> n.d. </div> </div> <br> <div class="box10"> <div class="lin-area-c1"> 8056 </div> <div class="lin-area-c2"> administração e gestão pública </div> <div class="lin-area-c3"> [lic-1º cic] </div> </div> <div class="lin-curso"> <div class="lin-curso-c1"> </div> <div class="lin-curso-c2"> 4275 </div> <div class="lin-curso-c3"> <a href="detcursopi.asp?codc=8056&code=4275" title="4275/8056">instituto superior de ciências da administração</a> </div> <div class="lin-curso-c4"> 20 </div> </div> <br> <div class="box10"> <div class="lin-area-c1"> 8194 </div> <div class="lin-area-c2"> administração da guarda nacional republicana </div> <div class="lin-area-c3"> [mest integ] </div> </div> <div class="lin-curso"> <div class="lin-curso-c1"> </div> <div class="lin-curso-c2"> 7510 </div> <div class="lin-curso-c3"> <a href="detcursopi.asp?codc=8194&code=7510" title="7510/8194">academia militar</a> </div> <div class="lin-curso-c4"> n.d. </div> </div> <br> <div class="box10"> <div class="lin-area-c1"> 9672 </div> <div class="lin-area-c2"> administração e marketing </div> <div class="lin-area-c3"> [lic-1º cic] </div> </div> box10 , line-curso should form element , don't. because in lines there 1 box10 1 lin-curso there lines lin-curso 1 box10 , if box10 , lin-curso element there wouldn't problem , there way can associate 2 ?
edit: website link : http://www.dges.gov.pt/guias/indcurso.asp?letra=a
and element ".inside"
solution problem easy when use sibling selector. in case div class box10 plays role of header in table , sibling divs class lin-curso play role of table data rows. suggest firstly selecting divs class box10:
elements boxes = doc.select("div.box10"); then can iterate on boxes , 2 major things:
- extract data interested in div (it contains 3 child nodes, divs classes
lin-area-c1,lin-area-c2,lin-area-c3) - select sibling nodes class
lin-curso, extract data them.
jsoup provides method called element.nextelementsibling() return sibling element element called method on. when call on element div.box10 sibling element div.lin-curso.
sibling in case means node following specified node @ same tree level.
exemplary solution
below can find exemplary code parses given website , prints table console output:
import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements; import java.io.ioexception; final class testmain { public static void main(string[] args) throws ioexception { document doc = jsoup.connect("http://www.dges.gov.pt/guias/indcurso.asp?letra=a").get(); elements boxes = doc.select("div.box10"); (element box : boxes) { string linareac1 = box.select(".lin-area-c1").text(); string linareac2 = box.select(".lin-area-c2").text(); string linareac3 = box.select(".lin-area-c3").text(); system.out.printf("%s: %s %s%n", linareac1, linareac2, linareac3); element lincurso = box.nextelementsibling(); while (lincurso.hasclass("lin-curso")) { string lincursoc2 = lincurso.select(".lin-curso-c2").text(); string lincursoc3 = lincurso.select(".lin-curso-c3").text(); string lincursoc4 = lincurso.select(".lin-curso-c4").text(); system.out.printf("%s\t%s\t%s%n", lincursoc2, lincursoc3, lincursoc4); lincurso = lincurso.nextelementsibling(); } system.out.println("=============================="); } } } i hope helps.
Comments
Post a Comment