java - How to select two (or more) HTML elements that exist at the same tree level with Jsoup? -


i'm working on project , faced problem. need scrape data website contains following html code:

<div class="lin-curso" style="border: 0;">     <div class="lin-area-c3">         vagas 2017     </div> </div> <div class="box10">     <div class="lin-area-c1">         l160     </div>     <div class="lin-area-c2">         acupuntura     </div>     <div class="lin-area-c3">         [lic-1º cic]     </div> </div> <div class="lin-curso">     <div class="lin-curso-c1">         &nbsp;     </div>     <div class="lin-curso-c2">         3155     </div>     <div class="lin-curso-c3">         <a href="detcursopi.asp?codc=l160&amp;code=3155" title="3155/l160">instituto politécnico de setúbal - escola superior de saúde</a>     </div>     <div class="lin-curso-c4">         20     </div> </div> <br> <div class="box10">     <div class="lin-area-c1">         9059     </div>     <div class="lin-area-c2">         administração e gestão de empresas     </div>     <div class="lin-area-c3">         [lic-1º cic]     </div> </div> <div class="lin-curso">     <div class="lin-curso-c1">         &nbsp;     </div>     <div class="lin-curso-c2">         2270     </div>     <div class="lin-curso-c3">         <a href="detcursopi.asp?codc=9059&amp;code=2270" title="2270/9059">universidade católica portuguesa - faculdade de ciências económicas e empresariais</a>     </div>     <div class="lin-curso-c4">         n.d.     </div> </div> <br> <div class="box10">     <div class="lin-area-c1">         8056     </div>     <div class="lin-area-c2">         administração e gestão pública     </div>     <div class="lin-area-c3">         [lic-1º cic]     </div> </div> <div class="lin-curso">     <div class="lin-curso-c1">         &nbsp;     </div>     <div class="lin-curso-c2">         4275     </div>     <div class="lin-curso-c3">         <a href="detcursopi.asp?codc=8056&amp;code=4275" title="4275/8056">instituto superior de ciências da administração</a>     </div>     <div class="lin-curso-c4">         20     </div> </div> <br> <div class="box10">     <div class="lin-area-c1">         8194     </div>     <div class="lin-area-c2">         administração da guarda nacional republicana     </div>     <div class="lin-area-c3">         [mest integ]     </div> </div> <div class="lin-curso">     <div class="lin-curso-c1">         &nbsp;     </div>     <div class="lin-curso-c2">         7510     </div>     <div class="lin-curso-c3">         <a href="detcursopi.asp?codc=8194&amp;code=7510" title="7510/8194">academia militar</a>     </div>     <div class="lin-curso-c4">         n.d.     </div> </div> <br> <div class="box10">     <div class="lin-area-c1">         9672     </div>     <div class="lin-area-c2">         administração e marketing     </div>     <div class="lin-area-c3">         [lic-1º cic]     </div> </div> 

box10 , line-curso should form element , don't. because in lines there 1 box10 1 lin-curso there lines lin-curso 1 box10 , if box10 , lin-curso element there wouldn't problem , there way can associate 2 ?

edit: website link : http://www.dges.gov.pt/guias/indcurso.asp?letra=a

and element ".inside"

solution problem easy when use sibling selector. in case div class box10 plays role of header in table , sibling divs class lin-curso play role of table data rows. suggest firstly selecting divs class box10:

elements boxes = doc.select("div.box10"); 

then can iterate on boxes , 2 major things:

  1. extract data interested in div (it contains 3 child nodes, divs classes lin-area-c1, lin-area-c2 , lin-area-c3)
  2. select sibling nodes class lin-curso , extract data them.

jsoup provides method called element.nextelementsibling() return sibling element element called method on. when call on element div.box10 sibling element div.lin-curso.

sibling in case means node following specified node @ same tree level.

exemplary solution

below can find exemplary code parses given website , prints table console output:

import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements;  import java.io.ioexception;  final class testmain {      public static void main(string[] args) throws ioexception {         document doc = jsoup.connect("http://www.dges.gov.pt/guias/indcurso.asp?letra=a").get();          elements boxes = doc.select("div.box10");          (element box : boxes) {             string linareac1 = box.select(".lin-area-c1").text();             string linareac2 = box.select(".lin-area-c2").text();             string linareac3 = box.select(".lin-area-c3").text();              system.out.printf("%s: %s %s%n", linareac1, linareac2, linareac3);              element lincurso = box.nextelementsibling();              while (lincurso.hasclass("lin-curso")) {                 string lincursoc2 = lincurso.select(".lin-curso-c2").text();                 string lincursoc3 = lincurso.select(".lin-curso-c3").text();                 string lincursoc4 = lincurso.select(".lin-curso-c4").text();                  system.out.printf("%s\t%s\t%s%n", lincursoc2, lincursoc3, lincursoc4);                  lincurso = lincurso.nextelementsibling();             }              system.out.println("==============================");         }     } } 

i hope helps.


Comments

Popular posts from this blog

What is happening when Matlab is starting a "parallel pool"? -

angular - DownloadURL return null in below code -

php - Cannot override Laravel Spark authentication with own implementation -