web scraping - How to scrape multiple tables that are without IDs or Class using R -
i'm trying scrape webpage using r : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (all pages)
i'm new programming. , everywhere i've looked, tables identified ids or divs or class. on page there's none. data stored in table format. how should scrape it?
this did :
library(rvest) webpage <- read_html("http://zipnet.in/index.php page=missing_mobile_phones_search&criteria=browse_all") tbls <- html_nodes(webpage, "table") head(tbls) tbls_ls <- webpage %>% html_nodes("table") %>% .[9:10] %>% html_table(fill = true) colnames(tbls_ls[[1]]) <- c("mobile make", "state", "district", "police station", "status", "mobile type(gsm/cdma)", "fir/dd/gd dat")
you can scrape table data targeting css id of each table. looks each page composed of 3 different tables pasted 1 after another. 2 of tables have #autonumber15
css id while third (in middle) has #autonumber16
css id.
i put simple code example should started in right direction.
suppressmessages(library(tidyverse)) suppressmessages(library(rvest)) # define function scrape table data page get_page <- function(page_id = 1) { # default link link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&page_no=" # build link link <- paste0(link, page_id) # tables data wp <- read_html(link) wp %>% html_nodes("#autonumber16, #autonumber15") %>% html_table(fill = true) %>% bind_rows() } # data first 3 pages iter_page <- 1:3 # progress bar pb <- progress_estimated(length(iter_page)) # code iterate on pages 1 through 3 , apply get_page() # function defined earlier. sys.sleep() part used pause code # after each iteration sever not overloaded requests. map_df(iter_page, ~ { pb$tick()$print() df <- get_page(.x) sys.sleep(sample(10, 1) * 0.1) as_tibble(df) }) #> # tibble: 72 x 4 #> x1 x2 x3 #> <chr> <chr> <chr> #> 1 fir/dd/gd number 000165 state #> 2 fir/dd/gd date 17/08/2017 district #> 3 mobile type(gsm/cdma) gsm police station #> 4 mobile make samsung j2 mobile number #> 5 missing/stolen date 23/04/2017 imei number #> 6 complainant akeel khan complainant contact number #> 7 status stolen/theft report date/time on zipnet #> 8 <na> <na> <na> #> 9 fir/dd/gd number fir no 37/ state #> 10 fir/dd/gd date 17/08/2017 district #> # ... 62 more rows, , 1 more variables: x4 <chr>
Comments
Post a Comment