{"id":20265,"date":"2025-03-08T00:00:00","date_gmt":"2025-03-08T00:00:00","guid":{"rendered":"https:\/\/thierrymoudiki.github.io\/\/blog\/2025\/03\/08\/r\/python\/llms\/word-online"},"modified":"2025-03-08T00:00:00","modified_gmt":"2025-03-08T00:00:00","slug":"word-online-recreating-karpathys-char-rnn-with-supervised-linear-online-learning-of-word-embeddings-for-text-completion","status":"publish","type":"post","link":"https:\/\/python-bloggers.com\/2025\/03\/word-online-recreating-karpathys-char-rnn-with-supervised-linear-online-learning-of-word-embeddings-for-text-completion\/","title":{"rendered":"Word-Online: recreating Karpathy&#8217;s char-RNN (with supervised linear online learning of word embeddings) for text completion"},"content":{"rendered":"<div style=\\\"border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;\\\">\r\n<i>This article was first published on  <strong>\r\n<a href=\"https:\/\/thierrymoudiki.github.io\/\/blog\/2025\/03\/08\/r\/python\/llms\/word-online\"> T. Moudiki's Webpage - Python <\/a><\/strong>, and kindly contributed to <a href=\/about\/>python-bloggers<\/a>.  (You can report issue about the content on this page <a href=\/contact-us\/>here<\/a>)\r\n<br\/>Want to share your content on python-bloggers?<a href=\/add-your-blog\/> click here<\/a>.<\/i>\r\n<\/div>\n<p>In this post, I implement a simple word completion model, based on <a href=\"https:\/\/karpathy.github.io\/2015\/05\/21\/rnn-effectiveness\/\">Karpathy\u2019s char-RNN<\/a>, but using <strong>supervised linear online learning of word embeddings<\/strong>. More precisely, I use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.SGDClassifier.html\">SGDClassifier<\/a> from <code>scikit-learn<\/code>, which is <strong>a simple linear classifier that can be updated incrementally<\/strong>.<\/p>\n<p>Keep in mind that this is an illustrative example, based on a few words and small vocabulary. There are many, many ways to improve the model, and many other configurations could be envisaged. So, feel free to experiment and <a href=\"https:\/\/github.com\/thierrymoudiki\/word-online\">extend this example<\/a>. Nonetheless, the grammatical structure of the generated text (don\u2019t generalize this result yet) is surprisingly good.<\/p>\n<p>My 2-cents-non-scientific (?) extrapolation about this is, is that artificial <em>neural<\/em> networks are not intrinsically better than other methods: it takes a model with high capacity, capable of learning and generalize well.<\/p>\n<p>Here is how to reproduce the example, assuming you named the file <code>word-online.py<\/code> (the repository is named <a href=\"https:\/\/github.com\/thierrymoudiki\/word-online\"><code>word-online<\/code><\/a>):<\/p>\n<pre>uv venv venv --python=3.11\nsource venv\/bin\/activate\nuv pip install -r requirements.txt\n<\/pre>\n<pre>python word-online.py\n<\/pre>\n<p><code>word-online.py<\/code> contains the following code:<\/p>\n<h1 id=\"python-version\">Python version<\/h1>\n<pre>import numpy as np\nimport gensim\nimport time  # Added for the delay parameter\n\nfrom collections import deque\nfrom tqdm import tqdm\nfrom scipy.special import softmax\nfrom sklearn.linear_model import SGDClassifier\n\n\n# Sample text \ntext = \"\"\"Hello world, this is an online learning example with word embeddings.\n          It learns words and generates text incrementally using an SGD classifier.\"\"\"\n\ndef debug_print(x):\n    print(f\"{x}\")\n\n# Tokenization (simple space-based)\nwords = text.lower().split()\nvocab = sorted(set(words))\nvocab.append(\"&lt;UNK&gt;\")  # Add unknown token for OOV words\n\n# Train Word2Vec model (or load pretrained embeddings)\nembedding_dim = 50  # Change to 100\/300 if using a larger model\nword2vec = gensim.models.Word2Vec([words], vector_size=embedding_dim, window=5, min_count=1, sg=0)\n\n# Create word-to-index mapping\nword_to_idx = {word: i for i, word in enumerate(vocab)}\nidx_to_word = {i: word for word, i in word_to_idx.items()}\n\n# Hyperparameters\ncontext_size = 12  # Default 10, Words used for prediction context\nlearning_rate = 0.005\nepochs = 10\n\n# Prepare training data\nX_train, y_train = [], []\n\nfor i in tqdm(range(len(words) - context_size)):\n    context = words[i:i + context_size]\n    target = words[i + context_size]\n    # Convert context words to embeddings\n    context_embedding = np.concatenate([word2vec.wv[word] for word in context])\n    X_train.append(context_embedding)\n    y_train.append(word_to_idx[target])\n\nX_train, y_train = np.array(X_train), np.array(y_train)\n\n# Initialize SGD-based classifier\nclf = SGDClassifier(loss=\"hinge\", max_iter=1, learning_rate=\"constant\", eta0=learning_rate)\n\n# Online training (stochastic updates, multiple passes)\nfor epoch in tqdm(range(epochs)):\n    for i in range(len(X_train)):\n        clf.partial_fit([X_train[i]], [y_train[i]], classes=np.arange(len(vocab)))\n\n# \ud83d\udd25 **Softmax function for probability scaling**\ndef softmax(logits):\n    exp_logits = np.exp(logits - np.max(logits))  # Stability trick\n    return exp_logits \/ np.sum(exp_logits)\n\n\ndef sample_from_logits(logits, k=5, temperature=1.0, random_seed=123):\n    \"\"\" Applies Top-K sampling &amp; Temperature scaling \"\"\"\n    logits = np.array(logits) \/ temperature  # Apply temperature scaling\n    probs = softmax(logits)  # Convert logits to probabilities\n    # Select top-K indices\n    top_k_indices = np.argsort(probs)[-k:]\n    top_k_probs = probs[top_k_indices]\n    top_k_probs \/= top_k_probs.sum()  # Normalize\n    # Sample from Top-K distribution\n    np.random.seed(random_seed)\n    return np.random.choice(top_k_indices, p=top_k_probs)\n\n\ndef generate_text(seed=\"this is\", length=20, k=5, temperature=1.0, random_state=123, delay=3):\n    seed_words = seed.lower().split()\n\n    # Ensure context has `context_size` words (pad with zero vectors if needed)\n    while len(seed_words) &lt; context_size:\n        seed_words.insert(0, \"&lt;PAD&gt;\")\n\n    context = deque(\n        [word_to_idx[word] if word in word_to_idx else -1 for word in seed_words[-context_size:]],\n        maxlen=context_size\n    )\n\n    generated = seed\n    previous_word = seed\n\n    for _ in range(length):\n        # Generate embeddings, use a zero vector if word is missing\n        context_embedding = np.concatenate([\n            word2vec.wv[idx_to_word[idx]] if idx in idx_to_word else np.zeros(embedding_dim)\n            for idx in context\n        ])\n        logits = clf.decision_function([context_embedding])[0]  # Get raw scores\n        # Sample next word using Top-K &amp; Temperature scaling\n        pred_idx = sample_from_logits(logits, k=k, temperature=temperature)\n        next_word = idx_to_word.get(pred_idx, \"&lt;PAD&gt;\")\n        \n        print(f\"Generating next word: {next_word}\")  # Added this line\n        time.sleep(delay)  # Added this line\n        \n        if previous_word[-1] == \".\" and previous_word[-1] != \"\" and previous_word[-1] != seed:\n          generated += \" \" + next_word.capitalize()\n        else: \n          generated += \" \" + next_word\n        previous_word = next_word\n        context.append(pred_idx)\n\n    return generated\n\n# \ud83d\udd25 Generate text\nprint(\"\\n\\n Generated Text:\")\nseed = \"This is a\"\nprint(seed)\nprint(generate_text(seed, length=12, k=1, delay=0)) # delay seconds for next word generation, optimal for delay=0 seconds \n<\/pre>\n<pre>100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10\/10 [00:00&lt;00:00, 12164.45it\/s]\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10\/10 [00:01&lt;00:00,  8.34it\/s]\n\n\n Generated Text:\nThis is a\nGenerating next word: classifier.\nGenerating next word: an\nGenerating next word: sgd\nGenerating next word: classifier.\nGenerating next word: and\nGenerating next word: generates\nGenerating next word: text\nGenerating next word: incrementally\nGenerating next word: using\nGenerating next word: an\nGenerating next word: sgd\nGenerating next word: classifier.\nThis is a classifier. An sgd classifier. And generates text incrementally using an sgd classifier.\n<\/pre>\n<p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/thierrymoudiki.github.io\/images\/2025-03-08\/2025-03-08-image1.gif?w=578&#038;ssl=1\" alt=\"image-title-here\" \/><\/p>\n<h1 id=\"r-version\">R version<\/h1>\n<pre>%%R\nlibrary(reticulate)\nlibrary(progress)\nlibrary(stats)\n\n# Initialize Python modules through reticulate\nnp &lt;- import(\"numpy\")\ngensim &lt;- import(\"gensim\")\ntime &lt;- import(\"time\")  # Added for the delay parameter\n\n# Sample text \ntext &lt;- \"This is a model used for classification purposes. It applies continuous learning on word vectors, converting words into embeddings, learning from those embeddings, and gradually producing text through the iterative process of an SGD classifier.\"\n\ndebug_print &lt;- function(x) {\n  print(paste0(x))\n}\n\n# Tokenization (simple space-based)\nwords &lt;- strsplit(tolower(text), \"\\\\s+\")[[1L]]\nvocab &lt;- sort(unique(words))\nvocab &lt;- c(vocab, \"&lt;UNK&gt;\")  # Add unknown token for OOV words\n\n# Train Word2Vec model (or load pretrained embeddings)\nembedding_dim &lt;- 50L  # Change to 100\/300 if using a larger model\nword2vec &lt;- gensim$models$Word2Vec(list(words), vector_size=embedding_dim, window=5L, min_count=1L, sg=0L)\n\n# Ensure \"&lt;UNK&gt;\" is in the Word2Vec vocabulary\n# This is the crucial step to fix the KeyError\nif (!(\"&lt;UNK&gt;\" %in% word2vec$wv$index_to_key)) {\n  word2vec$wv$add_vector(\"&lt;UNK&gt;\", rep(0, embedding_dim))  # Add \"&lt;UNK&gt;\" with a zero vector\n}\n\n\n# Create word-to-index mapping\nword_to_idx &lt;- setNames(seq_along(vocab) - 1L, vocab)  # 0-based indexing to match Python\nidx_to_word &lt;- setNames(vocab, as.character(word_to_idx))\n\n# Hyperparameters\ncontext_size &lt;- 12L  # Default 10, Words used for prediction context\nlearning_rate &lt;- 0.005\nepochs &lt;- 10L\n    \n# Prepare training data\nX_train &lt;- list()\ny_train &lt;- list()\n\npb &lt;- progress_bar$new(total = length(words) - context_size)\nfor (i in 1L:(length(words) - context_size)) {\n  context &lt;- words[i:(i + context_size - 1L)]\n  target &lt;- words[i + context_size]\n  # Convert context words to embeddings\n  context_vectors &lt;- lapply(context, function(word) as.array(word2vec$wv[word]))\n  context_embedding &lt;- np$concatenate(context_vectors)\n  X_train[[i]] &lt;- context_embedding\n  y_train[[i]] &lt;- word_to_idx[target]\n  pb$tick()\n}\n\n# Initialize SGD-based classifier\nsklearn &lt;- import(\"sklearn.linear_model\")\nclf &lt;- sklearn$SGDClassifier(loss=\"hinge\", max_iter=1L, learning_rate=\"constant\", eta0=learning_rate)\n\n# Online training (stochastic updates, multiple passes)\npb &lt;- progress_bar$new(total = epochs)\nfor (epoch in 1L:epochs) {\n  for (i in 1L:length(X_train)) {\n    # Use the list version for indexing individual samples\n    clf$partial_fit(\n      np$array(list(X_train[[i]])), \n      np$array(list(y_train[[i]])), \n      classes=np$arange(length(vocab))\n    )\n  }\n  pb$tick()\n}\n\n# Softmax function for probability scaling\nsoftmax_fn &lt;- function(logits) {\n  exp_logits &lt;- exp(logits - max(logits))  # Stability trick\n  return(exp_logits \/ sum(exp_logits))\n}\n\nsample_from_logits &lt;- function(logits, k=5L, temperature=1.0, random_seed=123L) {\n  # Applies Top-K sampling &amp; Temperature scaling\n  logits &lt;- as.numeric(logits) \/ temperature  # Apply temperature scaling\n  probs &lt;- softmax_fn(logits)  # Convert logits to probabilities\n  \n  # Select top-K indices - ensure k doesn't exceed the length of logits\n  k &lt;- min(k, length(logits))\n  sorted_indices &lt;- order(probs)\n  top_k_indices &lt;- sorted_indices[(length(sorted_indices) - k + 1L):length(sorted_indices)]\n  \n  # Handle case where k=1 specially\n  if (k == 1L) {\n    return(top_k_indices)\n  }\n  \n  top_k_probs &lt;- probs[top_k_indices]\n  # Ensure probabilities sum to 1\n  top_k_probs &lt;- top_k_probs \/ sum(top_k_probs)\n  \n  # Check if all probabilities are valid\n  if (any(is.na(top_k_probs)) || length(top_k_probs) != length(top_k_indices)) {\n    # If there are issues with probabilities, just return the highest probability item\n    return(top_k_indices[which.max(probs[top_k_indices])])\n  }\n  \n  # Sample from Top-K distribution\n  set.seed(random_seed)\n  return(sample(top_k_indices, size=1L, prob=top_k_probs))\n}\n\ngenerate_text &lt;- function(seed=\"this is\", length=20L, k=5L, temperature=1.0, random_state=123L, delay=3L) {\n  seed_words &lt;- strsplit(tolower(seed), \"\\\\s+\")[[1L]]\n  \n  # Ensure context has `context_size` words (pad with zero vectors if needed)\n  while (length(seed_words) &lt; context_size) {\n    seed_words &lt;- c(\"&lt;PAD&gt;\", seed_words)\n  }\n  \n  # Use a fixed-size list as a ring buffer\n  context &lt;- vector(\"list\", context_size)\n  for (i in 1L:context_size) {\n    word &lt;- tail(seed_words, context_size)[i]\n    if (word %in% names(word_to_idx)) {\n      context[[i]] &lt;- word_to_idx[word]\n    } else {\n      context[[i]] &lt;- -1L\n    }\n  }\n  \n  # Track position in the ring buffer\n  context_pos &lt;- 1L\n  \n  generated &lt;- seed\n  previous_word &lt;- seed\n  \n  for (i in 1L:length) {\n    # Generate embeddings, use a zero vector if word is missing\n    context_vectors &lt;- list()\n    for (idx in unlist(context)) {\n      if (as.character(idx) %in% names(idx_to_word)) {\n        word &lt;- idx_to_word[as.character(idx)]\n        context_vectors &lt;- c(context_vectors, list(as.array(word2vec$wv[word])))\n      } else {\n        context_vectors &lt;- c(context_vectors, list(np$zeros(embedding_dim)))\n      }\n    }\n    \n    context_embedding &lt;- np$concatenate(context_vectors)\n    logits &lt;- clf$decision_function(np$array(list(context_embedding)))[1L,]\n    \n    # Sample next word using Top-K &amp; Temperature scaling\n    pred_idx &lt;- sample_from_logits(logits, k=k, temperature=temperature, random_seed=random_state+i)\n    next_word &lt;- if (as.character(pred_idx) %in% names(idx_to_word)) {\n      idx_to_word[as.character(pred_idx)]\n    } else {\n      \"&lt;PAD&gt;\"\n    }\n    \n    print(paste0(\"Generating next word: \", next_word))\n    if (delay &gt; 0) {\n      time$sleep(delay)  # Added delay\n    }\n    \n    if (substr(previous_word, nchar(previous_word), nchar(previous_word)) == \".\" &amp;&amp; \n        previous_word != \"\" &amp;&amp; previous_word != seed) {\n      generated &lt;- paste0(generated, \" \", toupper(substr(next_word, 1, 1)), substr(next_word, 2, nchar(next_word)))\n    } else {\n      generated &lt;- paste0(generated, \" \", next_word)\n    }\n    \n    previous_word &lt;- next_word\n    \n    # Update context (ring buffer style)\n    context[[context_pos]] &lt;- pred_idx\n    context_pos &lt;- (context_pos %% context_size) + 1L\n  }\n  \n  return(generated)\n}\n    \ncat(\"\\n\\n Generated Text:\\n\")\nseed &lt;- \"This classifier is\"\ncat(seed, \"\\n\")\nresult &lt;- generate_text(seed, length=2L, k=3L, delay=0L)  # delay seconds for next word generation\nprint(result)    \n<\/pre>\n<pre>Generated Text:\nThis classifier is \n[1] \"Generating next word: for\"\n[1] \"Generating next word: text\"\n[1] \"This classifier is for text\"\n<\/pre>\n\n<div style=\\\"border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;\\\">\r\n<div style=\\\"text-align: center;\\\">To <strong>leave a comment<\/strong> for the author, please follow the link and comment on their blog: <strong><a href=\"https:\/\/thierrymoudiki.github.io\/\/blog\/2025\/03\/08\/r\/python\/llms\/word-online\"> T. Moudiki's Webpage - Python <\/a><\/strong>.<\/div>\r\n<hr \/>\r\nWant to share your content on python-bloggers?<a href=\/add-your-blog\/ rel=\\\"nofollow\\\"> click here<\/a>.\r\n<\/div>","protected":false},"excerpt":{"rendered":"<p>R and Python implementations of word completion<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-20265","post","type-post","status-publish","format-standard","hentry","category-data-science"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[],"jetpack_sharing_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/posts\/20265","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/comments?post=20265"}],"version-history":[{"count":2,"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/posts\/20265\/revisions"}],"predecessor-version":[{"id":20267,"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/posts\/20265\/revisions\/20267"}],"wp:attachment":[{"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/media?parent=20265"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/categories?post=20265"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/python-bloggers.com\/wp-json\/wp\/v2\/tags?post=20265"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}