31 Aug 2012

Follow-Up: Making a Word Cloud for a Search Result from GScholar_Scraper_3.1


Here's a short follow-up on how to produce a word cloud for a search result from GScholarScraper_3.1:

# File-Name: GScholarScraper_3.1.R
# Date: 2012-08-22
# Author: Kay Cichini
# Email: kay.cichini@gmail.com
# Purpose: Scrape Google Scholar search result
# Packages used: XML
# Licence: CC BY-SA-NC
#
# Arguments:
# (1) input:
# A search string as used in Google Scholar search dialog
#
# (2) write:
# Logical, should a table be writen to user default directory?
# if TRUE ("T") a CSV-file with hyperlinks to the publications will be created.
#
# Difference to version 3:
# (3) added "since" argument - define year since when publications should be returned..
# defaults to 1900..
#
# (4) added "citation" argument - logical, if "0" citations are included
# defaults to "1" and no citations will be included..
# added field "YEAR" to output 
#
# Caveat: if a submitted search string gives more than 1000 hits there seem
# to be some problems (I guess I'm being stopped by Google for roboting the site..)
#
# And, there is an issue with this error message:
# > Error in htmlParse(URL): 
# > error in creating parser for http://scholar.google.com/scholar?q
# I haven't figured out his one yet.. most likely also a Google blocking mechanism..
# Reconnecting / new IP-address helps..


GScholar_Scraper <- function(input, since = 1900, write = F, citation = 1) {

    require(XML)

    # putting together the search-URL:
    URL <- paste("http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
                 citation, "&as_ylo=", since, sep = "")
    cat("\nThe URL used is: ", "\n----\n", paste("* ", "http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
                 citation, "&as_ylo=", since, " *", sep = ""))
    
    # get content and parse it:
    doc <- htmlParse(URL)
    
    # number of hits:
    h1 <- xpathSApply(doc, "//div[@id='gs_ab_md']", xmlValue)
    h2 <- strsplit(h1, " ")[[1]][2] 
    num <- as.integer(sub("[[:punct:]]", "", h2))
    cat("\n\nNumber of hits: ", num, "\n----\n", "If this number is far from the returned results\nsomething might have gone wrong..\n\n", sep = "")
    
    # If there are no results, stop and throw an error message:
    if (num == 0 | is.na(num)) {
        stop("\n\n...There is no result for the submitted search string!")
    }
    
    pages.max <- ceiling(num/100)
    
    # 'start' as used in URL:
    start <- 100 * 1:pages.max - 100
    
    # Collect URLs as list:
    URLs <- paste("http://scholar.google.com/scholar?start=", start, "&q=", input, 
                  "&num=100&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, sep = "")
    
    scraper_internal <- function(x) {
        
        doc <- htmlParse(x, encoding="UTF-8")
        
        # titles:
        tit <- xpathSApply(doc, "//h3[@class='gs_rt']", xmlValue)
        
        # publication:
        pub <- xpathSApply(doc, "//div[@class='gs_a']", xmlValue)
        
        # links:
        lin <- xpathSApply(doc, "//h3[@class='gs_rt']/a", xmlAttrs)
        
        # summaries are truncated, and thus wont be used..  
        # abst <- xpathSApply(doc, '//div[@class='gs_rs']', xmlValue)
        # ..to be extended for individual needs
        options(warn=(-1))
        dat <- data.frame(TITLES = tit, PUBLICATION = pub, 
                          YEAR = as.integer(gsub(".*\\s(\\d{4})\\s.*", "\\1", pub)),
                          LINKS = lin)
        options(warn=0)
        return(dat)
    }

    result <- do.call("rbind", lapply(URLs, scraper_internal))
    if (write == T) {
      result$LINKS <- paste("=Hyperlink(","\"", result$LINKS, "\"", ")", sep = "")
      write.table(result, "GScholar_Output.CSV", sep = ";", 
                  row.names = F, quote = F)
      shell.exec("GScholar_Output.CSV") 
      } else {
      return(result)
    }
}

# EXAMPLE:

input <- "allintitle:amphibian+diversity"
df <- GScholar_Scraper(input, since = 1980, citation = 1)

#install.packages("tm")
library(tm)

#install.packages("wordcloud")
library(wordcloud)

corpus <- Corpus(VectorSource(df$TITLES))
corpus <- tm_map(corpus, function(x)removeWords(x, c(stopwords(), "PDF", "B", "DOC", "HTML", "BOOK", "CITATION")))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)

# remove numbers from strings:
d <- d[-grep("[0-9]", d$word), ]

# print wordcloud:
wordcloud(d$word, d$freq)



25 Aug 2012

Toy Example with GScholarScraper_3.1

A commentator on my blog brought up this nice idea of how to use the GScholarScraper function for bibliometrics..
I altered the code a little bit which enables to set a year since when results should be returned and added a field to the output collecting the year of publication. With this you can simply do something like this:







input <- "intitle:metapopulation"
df <- GScholar_Scraper(input, since = 1980, citation = 1)
nrow(df)
hist(df$YEAR, xlab = "Year", 
     main = "Frequency of Publications with\n\"METAPOPULATION\" in Title")

14 Nov 2011

In Reply to Ben Bolker's Post "Google Scholar (still) sucks"

Replying to Ben Bolker's post Google Scholar (still) sucks:

Ben,

thanks for illustrating the issue in your post!

The main purpose of my function GScholarScraper is to retrieve titles - just because this is the best we can get from Google Scholar. Abstracts are truncated and thus shouldn't be used for meta-analysis. Also titles are truncated, as you said, and there is no way around. Though, this is not as often and severe as with abstracts, i.e.

The CSV is optional, the df with word frequencies and the word cloud are always returned - for any other output one can easily add some appropriate lines to the script.

My opinion:
My function is good for a quick summary and illustration of a query-result.

Tony's function is evidently better if you want to pull all fields of a given query (authors, titles, abstracts,..)

I wonder if people came across ROpenSci? I guess that might be very interesting in this context!

Last remark: Of course, a Google Scholar API would resolve all our problems in this regard..

Best,
Kay

9 Nov 2011

R-Function GScholarScraper to Webscrape Google Scholar Search Result

NOTE: You'll find the update HERE and HERE.

NOTE: The script is currently not working because the code of the Google-Scholar site has changed...
I'll see for this as soon as I find some spare time for it!

NOTE: If you try to access GoogleScholar programatically consider this words of caution:
http://stackoverflow.com/questions/7523961/google-scholar-with-matlab/7587994#7587994
...

Based on my previous post on Web Scraping I coded and uploaded the Function "GScholarScraper" HERE for testing!
The function will pull all (!) results, processing pages in chunks of 100 results/titles, and return a file with all titles, links, etc. It will also produce a word cloud using the words in the publication titles.

Please try your own search strings and report errors, etc.!

Build and run properly under:
R version 2.13.0 (2011-04-13) and R version R-2.13.2 (2011-09-30)

Platform: i386-pc-mingw32/i386 (32-bit) locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringr_0.5 tm_0.5-6 wordcloud_1.2 Rcpp_0.9.7

loaded via a namespace (and not attached):
[1] plyr_1.5.1 slam_0.1-23

PS: Errors reported lately (see comments) were resolved, the source code was updated..