Little useless-useful R functions – Markov babbler

This one is named, yes, you guessed it, after Markov chains. 🙂 The babbler is there to connotate the simplicity of useless R function.

It’s simple calculation of probability of words chaining and drawing the multiple times appeared chained words reminds of markov chain (although this is not it!).

The gist is is tokenization of words, counting the appearances and calculating the probabilities.

markov_babbler <- function(text, order = 2, n = 50, by_word = TRUE) {
  tokens <- if (by_word) str_split(text, "\\s+")[[1]] else unlist(str_split(text, ""))
  tokens <- tokens[tokens != ""]
  
  #add the removal of full stops,....
  token <- c('I', 'I am', 'to', 'all', 'Oh')
  
  df <- data.frame(
    from = sapply(seq_len(length(tokens) - order), function(i) paste(tokens[i:(i + order - 1)], collapse = " ")),
    to = tokens[(order + 1):length(tokens)],
    stringsAsFactors = FALSE
  )
  
  probs <- df %>%
    group_by(from, to) %>%
    summarise(freq = n(), .groups = "drop") %>%
    group_by(from) %>%
    mutate(prob = freq / sum(freq))
  
  current <- sample(unique(probs$from), 1)
  output <- unlist(str_split(current, " "))
  
  for (i in seq_len(n)) {
    next_word <- probs %>% filter(from == current)
    if (nrow(next_word) == 0) break
    next_token <- sample(next_word$to, 1, prob = next_word$prob)
    output <- c(output, next_token)
    current <- paste(tail(output, order), collapse = " ")
  }

Having this in mind, I have took Red Ridding hood (Brother Grimm) and plugged the story into the function. In both English and Slovenian languages.

Playing around with useless statistics is fun. Useless fun 🙂

And no function is complete with little ggplot for drawing the network of words.

  g <- graph_from_data_frame(probs %>% filter(freq > 1), directed = TRUE)
  plot <- ggraph(g, layout = "fr") +
    geom_edge_link(aes(edge_alpha = prob, edge_width = prob), color = "firebrick") +
    geom_node_label(aes(label = name), size = 4, repel = TRUE) +
    theme_void() +
    labs(title = "Markov Chain: Token Transitions")

As always, the complete code is available on GitHub in  Useless_R_function repository. The sample file in this repository is here (filename: Markov_babbler.R). Check the repository for future updates.

Happy R-coding and stay healthy!


Tagged with: , , , , , , , , ,
Posted in R, Uncategorized, Useless R functions

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, Fabric, enterprise Power BI, SQL Server BI, Data Modeling, SSAS Design, SSRS, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

William Durkin

William Durkin a blog on SQL Server, Replication, Performance Tuning and whatever else.

$hell Your Experience !!!

As aventuras de um DBA usando o Poder do $hell

Design a site like this with WordPress.com
Get started