DiscoverR-bloggersLittle useless-useful R functions – Markov babbler
Little useless-useful R functions – Markov babbler

Little useless-useful R functions – Markov babbler

Update: 2025-09-10
Share

Description




[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


This one is named, yes, you guessed it, after Markov chains. 🙂 The babbler is there to connotate the simplicity of useless R function.





It’s simple calculation of probability of words chaining and drawing the multiple times appeared chained words reminds of markov chain (although this is not it!).





The gist is is tokenization of words, counting the appearances and calculating the probabilities.





markov_babbler <- function(text, order = 2, n = 50, by_word = TRUE) {
tokens <- if (by_word) str_split(text, "\\s+")[[1]] else unlist(str_split(text, ""))
tokens <- tokens[tokens != ""]

#add the removal of full stops,....
token <- c('I', 'I am', 'to', 'all', 'Oh')

df <- data.frame(
from = sapply(seq_len(length(tokens) - order), function(i) paste(tokens[i:(i + order - 1)], collapse = " ")),
to = tokens[(order + 1):length(tokens)],
stringsAsFactors = FALSE
)

probs <- df %>%
group_by(from, to) %>%
summarise(freq = n(), .groups = "drop") %>%
group_by(from) %>%
mutate(prob = freq / sum(freq))

current <- sample(unique(probs$from), 1)
output <- unlist(str_split(current, " "))

for (i in seq_len(n)) {
next_word <- probs %>% filter(from == current)
if (nrow(next_word) == 0) break
next_token <- sample(next_word$to, 1, prob = next_word$prob)
output <- c(output, next_token)
current <- paste(tail(output, order), collapse = " ")
}



Having this in mind, I have took Red Ridding hood (Brother Grimm) and plugged the story into the function. In both English and Slovenian languages.





<figure class="aligncenter size-large is-resized"></figure>



<figure class="aligncenter size-large is-resized"></figure>







Playing around with useless statistics is fun. Useless fun 🙂





And no function is complete with little ggplot for drawing the network of words.





g <- graph_from_data_frame(probs %>% filter(freq > 1), directed = TRUE)
plot <- ggraph(g, layout = "fr") +
geom_edge_link(aes(edge_alpha = prob, edge_width = prob), color = "firebrick") +
geom_node_label(aes(label = name), size = 4, repel = TRUE) +
theme_void() +
labs(title = "Markov Chain: Token Transitions")



As always, the complete code is available on GitHub in  Useless_R_function repository. The sample file in this repository is here (filename: Markov_babbler.R). Check the repository for future updates.





Happy R-coding and stay healthy!









To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.



R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Little useless-useful R functions – Markov babbler
Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Little useless-useful R functions – Markov babbler

Little useless-useful R functions – Markov babbler

tomaztsql