Phil Ferriere's OSS Work: R

In our previous post, we covered {mscsweblm4r}, a R package available on CRAN that wraps the Microsoft Cognitive Services Web Language Model REST API. In today's post, we'll go over {mscstexta4r}, a new R package that gives us access to another silo of Microsoft's NLP technology, its Text Analytics API.

What does {mscstexta4r} do?

Per Microsoft's website, the Microsoft Cognitive Services Text Analytics REST API is a suite of text analytics web services built with Azure Machine Learning to analyze unstructured text. By exposing R bindings for this API, {mscstexta4r} allows the following operations:

Sentiment analysis - Is a sentence or document generally positive or negative?
Topic detection - What's being discussed across a list of documents/reviews/articles?
Language detection - What language is a document written in?
Key talking points extraction - What's being discussed in a single document?

Sentiment analysis

The API returns a numeric score between 0 and 1. Scores close to 1 indicate positive sentiment and scores close to 0 indicate negative sentiment. Sentiment score is generated using classification techniques. The input features of the classifier include n-grams, features generated from part-of-speech tags, and word embeddings. English, French, Spanish and Portuguese text are supported.

Sentiment analysis is often applied to product and business reviews (Amazon, Yelp, TripAdvisor, etc.) for marketing/customer service purposes. It is also increasingly used in fintech for stock prediction using Twitter opinion mining, general stock market behavior prediction, etc.

Topic detection

This API returns the detected topics for a list of submitted text records. A topic is identified with a key phrase, which can be one or more related words. This API requires a minimum of 100 text records to be submitted, but is designed to detect topics across hundreds to thousands of records. The API is designed to work well for short, human-written text such as reviews and user feedback. English is the only language supported at this time.

Topic detection/mining can certainly help identify predominant themes across vast quantities of reviews but it can also be used to detect and index similar documents in large corpora, or cluster documents by their inferred topic.

Language detection

This API returns the detected language and a numeric score between 0 and 1. Scores close to 1 indicate 100% certainty that the identified language is correct. A total of 120 languages are supported.

Extraction of key talking points

This API returns a list of strings denoting the key talking points in the input text. English, German, Spanish, and Japanese text are supported.

How do I install this package?

First things first: before you can use the {mscstexta4r} R package, you will need to have a valid account with Microsoft Cognitive Services. Once you have an account, Microsoft will provide you with a (free) API key listed under your subscriptions. After you've configured {mscstexta4r} with your API key, as explained in the Reference Manual on CRAN, you will be able to call the Text Analytics REST API from R, up to your maximum number of transactions per month and per minute -- again, for free.

You can install the latest stable version of {mscstexta4r} from CRAN as follows:

install.packages("mscstexta4r")

After following the Reference Manual instructions regarding API key configuration, you'll be able to use the package with:

library(mscstexta4r)
textaInit()

How do I use {mscstexta4r}?

With this package, as with {mscsweblm4r}, we've tried to hide as much of the complexity associated with RESTful API HTTP calls as possible. The following example demonstrates how trivial it is to use the sentiment analysis API from R:

docsText <- c(
  "Loved the food, service and atmosphere! We'll definitely be back.",
  "Very good food, reasonable prices, excellent service.",
  "It was a great restaurant.",
  "If steak is what you want, this is the place.",
  "The atmosphere is pretty bad but the food is quite good.",
  "The food is quite good but the atmosphere is pretty bad.",
  "The food wasn't very good.",
  "I'm not sure I would come back to this restaurant.",
  "While the food was good the service was a disappointment.",
  "I was very disappointed with both the service and my entree."
)
docsLanguage <- rep("en", length(docsText))

tryCatch({

  # Perform sentiment analysis
  textaSentiment(
    documents = docsText,    # Input sentences or documents
    languages = docsLanguage
    # "en"(English, default)|"es"(Spanish)|"fr"(French)|"pt"(Portuguese)
)

}, error = function(err) {

  # Print error
  geterrmessage()

})
#> texta [https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment]
#> 
#> --------------------------------------
#>              text               score 
#> ------------------------------ -------
#>  Loved the food, service and   0.9847 
#>  atmosphere! We'll definitely         
#>            be back.                   
#> 
#>   Very good food, reasonable   0.9831 
#>   prices, excellent service.          
#> 
#>   It was a great restaurant.   0.9306 
#> 
#>   If steak is what you want,   0.8014 
#>       this is the place.              
#> 
#>  The atmosphere is pretty bad  0.4998 
#>  but the food is quite good.          
#> 
#> The food is quite good but the  0.475 
#>   atmosphere is pretty bad.           
#> 
#>   The food wasn't very good.   0.1877 
#> 
#> I'm not sure I would come back 0.2857 
#>      to this restaurant.              
#> 
#>  While the food was good the   0.08727
#> service was a disappointment.         
#> 
#>  I was very disappointed with  0.01877
#>    both the service and my            
#>            entree.                    
#> --------------------------------------

It doesn't get any simpler than this. The results are conveniently formatted as a dataframe, as with all the other {mscstexta4r} functions. We've already covered in detail why it is important to use tryCatch() in our previous post, so we won't go over it again, except to remind you that HTTP requests over a network and the Internet can fail. tryCatch(), in our opinion, offers the best mechanism to handle those failures.

Synchronous vs asynchronous execution

All but one core text analytics functions execute exclusively in synchronous mode. textaDetectTopics() is the only function that can be executed either synchronously or asynchronously. Why? Because topic detection is typically a "batch" operation meant to be performed on thousands of related documents (product reviews, research articles, etc.).

What's the difference?

When textaDetectTopics() executes synchronously, you must wait for it to finish before you can move on to the next task. When textaDetectTopics() executes asynchronously, you can move on to something else before topic detection has completed.

When to run which mode

If you're performing topic detection in batch mode (from an R script), we recommend using the textaDetectTopics() in synchronous mode, in which case, again, it will return only after topic detection has completed.

If you need to operate the same R console while topic detection is being performed by the Microsoft Cognitive services servers, you should call textaDetectTopics() in asynchronous mode and then call textaDetectTopicsStatus() periodically yourself, until the Microsoft Cognitive Services server complete topic detection and results become available.

For additional details -- and plenty of sample code! -- please take a look at the Vignette included with the package or review the README in the package's GitHub repo.

How fast is {mscstexta4r}?

The answer to that question depends on many factors, including the kind of Internet service provider you are using. In our case, with this package and {mscsweblm4r}, we've been very impressed by the low latency of most HTTP calls, despite our less-than-stellar ISP.

This applies to sentiment analysis, language detection, and key talking points extraction, for which we get results almost immediately on a small number of short documents. With topic detection, the API requires a minimum of 100 documents. In our experience, topic detection often took between 7 and 9 minutes on batches of 100 typical Yelp-size reviews (if there is such a thing).

If you'd like to get a feel for the speed of execution of the {mscstexta4r} package without having to write any code for it, you may want to try MSCSShiny, the web app we created to test/demo our packages. You can try it live on shinyapps.io, or download it from GitHub. Note that if the web app on shinyapps.io doesn't show up, its monthly max usage for the free tier has already been exceeded. If the app shows up but doesn't give you results, it is because there are either too many people trying to use it at the same time, or the Microsoft Cognitive Services rate limiters have kicked in. For a fair assessment, I highly recommend that you install the web app from GitHub, set it up to use your personal API key, and run it on your own machine inside the wonderful RStudio.

Have fun!

Links

{mscstexta4r} on CRAN, on GitHub
MSCSShiny on shinyapps.io, on GitHub

All Microsoft Cognitive Services components are Copyright (c) Microsoft.

Microsoft Cognitive Services -- formerly known as Project Oxford -- are a set of APIs, SDKs and services that developers can use to add AI features to their applications. Those features include emotion and video detection; facial, speech and vision recognition; and speech and language understanding.

The Microsoft Cognitive Services website provides several code samples that illustrate how to use their APIs from C#, Java, JavaScript, ObjC, PHP, Python, Ruby, and... you guessed it -- if you want to test drive their service from R, you're pretty much on your own.

To experiment with Microsoft's NLP technology in R, we've developed R bindings for a subset of their work, the Web Language Model API.

What is {mscsweblm4r}?

{mscsweblm4r} is a R package, downloadable from CRAN, that simply wraps the Microsoft Cognitive Services Web Language Model REST API,

Per Microsoft's website, this API uses smoothed Backoff N-gram language models (supporting Markov order up to 5) that were trained on four web-scale American English corpora collected by Bing (web page body, title, anchor and query).

The MSCS Web Language Model REST API supports four lookup operations:

Calculate the joint probability that a sequence of words will appear together.
Compute the conditional probability that a specific word will follow an existing sequence of words.
Get the list of words (completions) most likely to follow a given sequence of words.
Insert spaces into a string of words adjoined together without any spaces (hashtags, URLs, etc.).

How do I get started?

To use the {mscsweblm4r} package, you **MUST** have a valid account with Microsoft Cognitive Services. Once you have an account, Microsoft will provide you with a (free) API key. The key will be listed under your subscriptions.

At the time of this writing, we're allowed 100,000 transactions per month, 1,000 per minute, more than enough for us to evaluate the technology -- for free...

After you've configured {mscsweblm4r} with your API key (see the Reference manual on CRAN for details), you will be able to call the Web Language Model REST API from R, up to your maximum number of transactions per month and per minute.

How hard is it to use?
{mscsweblm4r} tries to make it as simple as possible. Like most other published packages, you can install it from CRAN with:

install.packages("mscsweblm4r")

After you've followed the Reference Manual instructions regarding API key configuration, you'll be ready to use the package with:

library(mscsweblm4r)
weblmInit()

Here's a trivial example that allows you to get the list of word candidates most likely to follow a very common sentence beginning -- "how are you"

tryCatch({

  # Generate next words
  weblmGenerateNextWords(
    precedingWords = "how are you",  # ASCII only
    modelToUse = "title",            # "title"|"anchor"|"query"(default)|"body"
    orderOfNgram = 4L,               # 1L|2L|3L|4L|5L(default)
    maxNumOfCandidatesReturned = 5L  # Default: 5L
  )

}, error = function(err) {

  # Print error
  geterrmessage()

})
#> weblm [https://api.projectoxford.ai/text/weblm/v1.0/generateNextWords?model=title&words=how%20are%20you&order=4&maxNumOfCandidatesReturned=5]
#>
#> ---------------------
#>  word    probability
#> ------- -------------
#>  doing     -1.105
#>
#>   in       -1.239
#>
#> feeling    -1.249
#>
#>  going     -1.378
#>
#>  today      -1.43
#> ---------------------

No surprise there -- "how are you doing" is a common expression, indeed. What's with the negative probability, though? In the world of N-gram probabilistic language models, using log10(probabilities) is a common strategy to avoid numeric underflows when multiplying very small numbers (n-gram probabilities) together -- see page 22 of Dan Jurafsky's PDF on N-gram Language Models, here.

What matters more, here, is that weblmGenerateNextWords()'s results are conveniently formatted as a data.frame, as is the case for all {mscsweblm4r} core functions.

Why do I need to use tryCatch()?

The Web Language Model API is a RESTful API. HTTP requests over a network and the Internet can fail. Because of congestion, because the web site might be down for maintenance, because you decided to enjoy your cup of Starbucks coffee and work in the courtyard and your wireless connection suddenly dropped, etc. There are many possible points of failure.

The API call can also fail if you've exhausted your call volume quota or are exceeding the API calls rate limit. Unfortunately, Microsoft Cognitive Services does not expose an API you can query to check if you're about to exceed your quota, for instance. The only way you'll know is by looking at the error code returned after an API call has failed.

Therefore, you must write your R code with failure in mind. And for that, our preferred way is to use tryCatch(). Its mechanism may appear a bit daunting at first, but it is well documented.

We've also included many examples in the Reference manual that illustrate how to use it with other {mscsweblm4r} functions.

How fast is {mscsweblm4r}?

The internet service provider we use (and that shall rename nameless) isn't exactly known for its performance, to say the least. Despite this, we've often been floored by how low the latency was for most HTTP requests.

Yes, this is purely anecdotal, and your experience may turn out differently from ours. What we're hoping, though, is that some of you will be interested in using our package as a building block for your own evaluation framework. If you do, please share your results with us. We will gladly make room for them on the package's GitHub page for the NLP community at large to review.

Can I try a live demo?

Sure, you can. To validate our package, we developed a Shiny web app called MSCSShiny. The source code is available here. We've also published the application to shinyapps.io.

After our free shinyapps.io credits for this app expire, if you still want to try the web application, you'll be able to download the code from GitHub and run it locally, or push it to your favorite publishing platform. In all cases, you will have to configure it to use your own Web LM API key.

Enjoy!

Links

{mscsweblm4r} on CRAN

{mscsweblm4r} on GitHub

MSCSShiny on GitHub

MSCSShiny on shinyapps.io

All Microsoft Cognitive Services components are Copyright © Microsoft.

Phil Ferriere's OSS Work

Monday, June 20, 2016

Sentiment Analysis and Topic Detection in R using Microsoft Cognitive Services

What does {mscstexta4r} do?

Sentiment analysis

Topic detection

Language detection

Extraction of key talking points

How do I install this package?

How do I use {mscstexta4r}?

Synchronous vs asynchronous execution

What's the difference?

When to run which mode

How fast is {mscstexta4r}?

Links

Tuesday, May 24, 2016

Using the Microsoft Cognitive Services Web Language Model REST API from R

What is {mscsweblm4r}?

How do I get started?

Why do I need to use tryCatch()?

How fast is {mscsweblm4r}?

Can I try a live demo?

Links