web scraping - How to scrape multiple tables that are without IDs or Class using R -


i'm trying scrape webpage using r : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (all pages)

i'm new programming. , everywhere i've looked, tables identified ids or divs or class. on page there's none. data stored in table format. how should scrape it?

this did :

 library(rvest)  webpage <- read_html("http://zipnet.in/index.php      page=missing_mobile_phones_search&criteria=browse_all")   tbls <- html_nodes(webpage, "table")   head(tbls)  tbls_ls <- webpage %>% html_nodes("table") %>%           .[9:10] %>%  html_table(fill = true)  colnames(tbls_ls[[1]]) <- c("mobile make", "state", "district",                          "police station", "status", "mobile type(gsm/cdma)",                           "fir/dd/gd dat") 

you can scrape table data targeting css id of each table. looks each page composed of 3 different tables pasted 1 after another. 2 of tables have #autonumber15 css id while third (in middle) has #autonumber16 css id.

i put simple code example should started in right direction.

suppressmessages(library(tidyverse)) suppressmessages(library(rvest))  # define function scrape table data page get_page <- function(page_id = 1) {   # default link   link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&page_no="   # build link   link <- paste0(link, page_id)    # tables data   wp <- read_html(link)   wp %>%      html_nodes("#autonumber16, #autonumber15") %>%      html_table(fill = true) %>%      bind_rows() }  # data first 3 pages  iter_page <- 1:3 # progress bar pb <- progress_estimated(length(iter_page))  # code iterate on pages 1 through 3 , apply get_page()  # function defined earlier. sys.sleep() part used pause code # after each iteration sever not overloaded requests. map_df(iter_page, ~ {   pb$tick()$print()   df <- get_page(.x)   sys.sleep(sample(10, 1) * 0.1)   as_tibble(df) }) #> # tibble: 72 x 4 #>                       x1           x2                         x3 #>                    <chr>        <chr>                      <chr> #>  1      fir/dd/gd number       000165                      state #>  2        fir/dd/gd date   17/08/2017                   district #>  3 mobile type(gsm/cdma)          gsm             police station #>  4           mobile make   samsung j2              mobile number #>  5   missing/stolen date   23/04/2017                imei number #>  6           complainant   akeel khan complainant contact number #>  7                status stolen/theft report date/time on zipnet #>  8                  <na>         <na>                       <na> #>  9      fir/dd/gd number   fir no 37/                      state #> 10        fir/dd/gd date   17/08/2017                   district #> # ... 62 more rows, , 1 more variables: x4 <chr> 

Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -