java - Using jsoup to scrape data from a webpage in between specific tags -
currently developing program allows me collect recent 5 fanfiction stories added ao3 (archive of our own) fandom. these stories added arraylist have set hold fanfiction submissions past week. @ end of every week plan on having arraylist's contents dumped textfile allow me paste reddit post subreddit. now, prevent duplicates, wanted compare newly parsed stories stories held in arraylist.
(additional info: bot check webpage every 30 minutes)
the part i'm getting caught on actual parsing of webpage , getting content between html tags.
i looked css selectors, i'm still left thoroughly confused, every example seems easy website scrape from, such imbd.
from basic research, looks within main body i'm looking, stories inside ordered list tag.
<o1 class="work index group"> <li class="work blurb group" id="work_10504812" role="article>...</li> <li class="work blurb group" id="work_9656693" role="article>...</li> <li class="work blurb group" id="work_11814486" role="article>...</li> //goes on ~20 more stories <li class="work blurb group" id="work_11687247" role="article>...</li> </ol> so clarity's sake, each list type single story located within ordered list. within 1 list tag following. (ordered list tag added context)
<ol class="work index group"> <li class="work blurb group" id="work_10504812" role="article"> <!--title, author, fandom--> <div class="header module"> <h4 class="heading"> <a href="/works/10504812">pocket healer</a> <!-- not cache --> <a rel="author" href="/users/overnoot/pseuds/overnoot">overnoot</a> </h4> <h5 class="fandoms heading"> <span class="landmark">fandoms:</span> <a class="tag" href="/tags/overwatch%20(video%20game)/works">overwatch (video game)</a> </h5> <!--required tags--> <ul class="required-tags"> <li> <a class="help symbol question modal modal-attached" title="symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="general audiences"><span class="text">general audiences</span></span></a></li> <li> <a class="help symbol question modal modal-attached" title="symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="warning-no warnings" title="no archive warnings apply"><span class="text">no archive warnings apply</span></span></a></li> <li> <a class="help symbol question modal modal-attached" title="symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="category-femslash category" title="f/f"><span class="text">f/f</span></span></a></li> <li> <a class="help symbol question modal modal-attached" title="symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="complete-no iswip" title="work in progress"><span class="text">work in progress</span></span></a></li> </ul> <p class="datetime">17 aug 2017</p> </div> <!--warnings again, cast, freeform tags--> <h6 class="landmark heading">tags</h6> <ul class="tags commas"> <li class="warnings"><strong><a class="tag" href="/tags/no%20archive%20warnings%20apply/works">no archive warnings apply</a></strong></li><li class="relationships"><a class="tag" href="/tags/fareeha%20%22pharah%22%20amari*s*angela%20%22mercy%22%20ziegler/works">fareeha "pharah" amari/angela "mercy" ziegler</a></li><li class="characters"><a class="tag" href="/tags/fareeha%20%22pharah%22%20amari/works">fareeha "pharah" amari</a></li> <li class="characters"><a class="tag" href="/tags/angela%20%22mercy%22%20ziegler/works">angela "mercy" ziegler</a></li> <li class="characters"><a class="tag" href="/tags/winston%20(overwatch)/works">winston (overwatch)</a></li> <li class="characters"><a class="tag" href="/tags/lena%20%22tracer%22%20oxton/works">lena "tracer" oxton</a></li><li class="freeforms"><a class="tag" href="/tags/tiny%20pharah%20and%20tiny%20mercy/works">tiny pharah , tiny mercy</a></li> <li class="freeforms"><a class="tag" href="/tags/fluff/works">fluff</a></li> <li class="freeforms last"><a class="tag" href="/tags/cute/works">cute</a></li> </ul> <!--summary--> <h6 class="landmark heading">summary</h6> <blockquote class="userstuff summary"> <p>angela , fareeha wake find tiny alternate versions of have appeared , imprinted on them. how these tiny pharahs , mercies impact work @ overwatch , more importantly how impact feelings have each other.</p> </blockquote> <!--stats--> <dl class="stats"> <dt class="language">language:</dt> <dd class="language">english</dd> <dt class="words">words:</dt> <dd class="words">35,143</dd> <dt class="chapters">chapters:</dt> <dd class="chapters">10/11</dd> <dt class="comments">comments:</dt> <dd class="comments"><a href="/works/10504812?show_comments=true&view_full_work=true#comments">168</a></dd> <dt class="kudos">kudos:</dt> <dd class="kudos"><a href="/works/10504812?view_full_work=true#comments">438</a></dd> <dt class="bookmarks">bookmarks:</dt> <dd class="bookmarks"><a href="/works/10504812/bookmarks">35</a></dd> <dt class="hits">hits:</dt> <dd class="hits">5890</dd> </dl> </li> and wanted extract title, author, url, summary, , rating.
so far i've gathered locations of items want extract, have no actual idea how so.
title:
<a href="/works/10504812">pocket healer</a> author:
<a rel="author" href="/users/overnoot/pseuds/overnoot">overnoot</a> url:
<li class="work blurb group" id="work_10504812" role="article"> <!--(http://archiveofourown.com/works/<the number after 'work_'>)--> summary:
<blockquote class="userstuff summary"> <p> (summary goes here) </p> </blockquote> rating:
<li> <a class="help symbol question modal modal-attached" title="symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="general audiences"><span class="text">general audiences</span></span></a></li> additional question: possible iterate through contents of ordered list in forloop?
the current code have set opening webpage below.
while (true) { try { string url = "http://archiveofourown.org/tags/fareeha%20%22pharah%22%20amari*s*angela%20%22mercy%22%20ziegler/works"; document doc = jsoup.connect(url).get(); //returns element of webpage doc.select("<narrow down ordered list>"); //run loop run through first 5 items of thread.sleep(thirty_minutes); } catch (exception ex) { ex.printstacktrace(); } }
you can use document.select(string cssselector) method returns elements can iterate over. example ol.work > li return li elements first-level children ol.work element. can use iterate on stories.
consider following part of code:
elements ol = doc.select("ol.work > li"); (element li : ol) { string title = li.select("h4.heading a").first().text(); string author = li.select("h4.heading a[rel=author]").text(); string id = li.attr("id").replaceall("work_",""); string url = "http://archiveofourown.com/works/" + id; string summary = li.select("blockquote.summary").text(); string rating = li.select("span.rating").text(); system.out.println("title: " + title); system.out.println("author: " + author); system.out.println("id: " + id); system.out.println("url: " + url); system.out.println("summary: " + summary); system.out.println("rating: " + rating); } in example li elements in for-loop , extract expected content. can see use select method every data extraction limited current li element. element.text() method returns body of element plain text, removing tags if present.
running following code html put in question produces following output:
title: pocket healer author: overnoot id: 10504812 url: http://archiveofourown.com/works/10504812 summary: angela , fareeha wake find tiny alternate versions of have appeared , imprinted on them. how these tiny pharahs , mercies impact work @ overwatch , more importantly how impact feelings have each other. rating: general audiences i hope helps.
Comments
Post a Comment