A Primer to Language Model

Kevin Siswandi

6 min readApr 23, 2024

Language modeling is the foundation to large models such as GPT.

A language model measures the probability of a piece of text given a corpus (reference text).

🟢 Causal Language Model = measures the probability of the next word

🟢 Masked Language Model = measures the probability of a word masked out in the text.

MLM from Bert. Source: https://www.sbert.net/examples/unsupervised_learning/MLM/README.html

🟢 N-Gram Language Model = measures the number of times the N-gram sequence appears in the corpus (i.e. maximum likelihood estimation of the exact sequence).

Probabilities of an entire sequence of N-gram. Source: https://web.stanford.edu/~jurafsky/slp3/3.pdf

There are two ways to evaluate a language model:

Intrinsic evaluation = how well the model estimates the probability on a held-out text
Extrinsic evaluation = how well the model helps perform another task.

Intrinsic evaluation can be measured on the validation/test set using perplexity, which indicates how “surprising” the text is.

Low perplexity: high probability for the text (less surprising)
High perplexity: low probability for the text (more surprising)

Formula for perplexity. Source: https://community.deeplearning.ai/t/confused-about-perplexity-formula/205866/3

For extrinsic evaluation, a typical task to test the language model is to do basic arithmetic: e.g. what’s 1 + 2 equal to?

Use established benchmarks for extrinsic evaluation.
The test set should not be part of training data.

Text Generation in Tradional Language Models

Greedy Search

A simple generation strategy is the greedy search (iteratively generate the most likely word given the previous text).

✅Pros: simple and deterministic.

❌Cons:

doesn’t necessarily lead to the overall highest probability sequence.
tend to generate short and uninteresting text.

Beam Search

Beam search builds on the greedy search to keep the k-highest probability sequences (i.e. greedy search is beam search with k=1).

✅Pros: deterministic and can generate higher-probability sequences than greedy search.

❌Cons:

more computationally expensive than greedy search.
tend to generate bland or uninteresting text.

Sampling

Sampling randomly chooses the next word, weighted by probability.

✅Pros:

generates more interesting sequences
can be combined with other text generation strategies.

❌Cons: not deterministic

As a side note: I built a next-word prediction app based on N-gram back in 2014. To use it, simply enter at least two words, and it will suggest some possible next words. The app is built to be simple, fast (near-instantaneous result), reactive (possesses prediction reactively as user types input) and lightweight (minimal RAM and CPU requirement). However, the app is hosted on the free shinyapps.io platform, hence it may take some time to load when accessing it for the first time.
Here’s the link to the app ➡️ [https://kevinsis.shinyapps.io/wordapp/]

The Transformer Architecture

To begin with, we need to represent text as numerical vectors, where words with similar meanings be represented as similar vectors (vector similarity can be defined with e.g., Euclidean or cosine distance). This is where word embeddings come into play, as it allows the computational processing of N-dimensional word vectors.

The word vectors should also consider the context, as the same words may mean different things in different contexts → the notion of context vectors.

Tokenization

Tokenization is the splitting a block of text into smaller texts called tokens. The popular method is to use subword tokens (as splitting into words is not enough but splitting into individual characters is overkill). Algorithms such as Byte-Pair Encoding (BPE) can be used for subword tokenization that also handle new words as follows.

Strategies to deal with new words: split uncommon words into two or more subwords, as the individual subwords are likely more common and seen by the model, the subwords can often give a clue to the meaning of new word. Example: cryptocurrency, deepfake, etc. Furthermore, the subword tokens are to be represented using context vectors.

Self-attention Mechanism

The objective of a neural language model can be either to predict the next token in the sequence, or to calculate the probability of a whole sequence of text. The input to the neural network is the word embeddings, they are transformed via a series of nonlinear operations to produce a token output (probability scoring is typically done via softmax).

Neural language models may store a lookup of vectors for every token, then use a context vector maker to contextualize them into context vectors. This context vector maker also weights the importance of other words to decide on the meaning to use for a word that might have multiple meanings. Here, we need a function to calculate the relevance of one word for the context of another word, given the word vectors (without context) as the input.

The context vector for a particular word is then just a linear combination of transformed word vectors, with weighting based on the relevance scores.

The transformer block. Source: Wikipedia.

The transformer block uses self-attention in the following way:

use positional encoding to include information about position of tokens.
include a feed-forward neural network to introduce nonlinearity.
stack multiple self-attention layers to get multi-head attention: lower transformer layers likely dealing with basic syntax, and the higher level layers dealing with more complex reasoning. LLM models such as GPT3 may have 96 layers.

SESAME Street: BERT (vs. GPT)

Transformer training works well because:

it parallelises (previous approaches are serial — one word at a time).
it can be trained effectively with GPU/TPU (which PyTorch and TF does).

BERT (Bidirectional Encoder Representations from Transformers) focus on masked language modeling. The context vector maker used by BERT is the encoder transformer that produces ready-to-use context vectors (for e.g., to another classifier for sentiment analysis using CLS token).

GPT focuses on causal language modeling using the decoder architecture. The text given to a causal language model is known as a prompt. However, other language models may have been trained for e.g., instructions.

Aside: the trend of using Sesame Street characters to name LLM’s started with ELMo (by Allen Institute) in 2017, then BERT (by Google) in 2018, soon followed by ERNIE, KERMIT, and so on.

Practical Aspects of Working with LLM

Large Language Models (LLM), such as GPT-3 and beyond, can do broad tasks when given some examples (a.k.a. few-shot) or even no examples (i.e. zero shot) without any gradient updates.

Source: https://arxiv.org/abs/2005.14165

As training an LLM from scratch can be prohibitively expensive, use pre-trained models that can be downloaded using the Python transformers library from HuggingFace.

When deploying an LLM-based application, note also of two well-known problems: prompt injections and hallucinations.

Prompt injection attack on Bing Chat. Source: https://arstechnica.com/information-technology/2023/02/ai-powered-bing-chat-spills-its-secrets-via-prompt-injection-attack/

Reinforcement Learning with Human Feedback (RLHF)

In reinforcement learning, we define a way to obtain a reward (the higher the better) for each output. Similarly, in RLHF, the language model is fine-tuned to maximise expected rewards across all outputs. Here, we prepare a dataset of pairwise comparisons to train human preferences as a separate NLP problem. Note that RLHF agents are rewarded for responses that are preferred by humans, but not necessarily factual (i.e. it can still hallucinate but the output is improved).