LLM Primer
This section provides a high-level overview of key concepts related to Large Language Models (LLMs). If you are familiar with LLMs, you may want to proceed to LLM Ops.
Large Language Models (LLMs) use Transformer technology and serve as a foundational component in enabling computers to understand and generate human language. Drawing from large datasets, they learn linguistic structures, semantics, and context through a process called pre-training.
These models, in their most sophisticated form (eg. GPT-4), produce text that is remarkably human-like, able to answer questions, translate languages, and even generate creative content like document analysis or SQL queries.
Model Foundations
Large language models are built by training massive datasets of tokenized text to learn patterns and relationships between words. Through an intensive compute process, models ingest sequences of tokens to predict next words in context. The foundational dataset is what makes each model unique and plays the largest role in model performance.
As models train, their parameters are tuned to generate human-like responses. The size of a model, which is measured in parameters, typically determines its power and performance – however, a growing trend in highly-capable smaller models began in 2023 H2 due to innovations in training and inference. While state-of-the-art models have hundreds of billions of parameters, open-source models trend from 7 to 65 billion parameters.
- Name
Token
- Type
- Numerical representation of language
- Description
Tokens are a unique numerical representation of a word or partial word. Tokenization allows LLMs to handle and process text data.
Most LLMs have a 1.3:1 ratio of tokens to English words.
- Name
Parameters
- Type
- Internalized knowledge
- Description
Parameters represent the learned patterns and relationships between tokens within the training data. ML engineers convert massive datasets into tokenized training data for training.
Commonly used datasets include The Pile, CommonCrawl, OpenAssistant Conversations, or websites (Reddit, StackOverflow, Github).
- Name
Training
- Type
- Model computation
- Description
Training is the process of converting tokenized content into model parameters. The result is a reusable model for inference. The model is fed sequences of training tokens and it learns to predict the next token in the sequence. The goal is to fine-tune the model's parameters for accurate and contextually appropriate responses.
- Name
Model Size
- Type
- Number of parameters in training
- Description
Number of parameters is the typical measurement for model size. State-of-the-art models (GPT-4, PaLM 2) trend in hundreds to thousands of billions of parameters, while emerging open source models trend between 7-65B parameters (MPT, Falcon).
Model Inference
Inference is the process of using a trained LLM model to make predictions on new content to generate. A model is loaded into memory, new data is presented in the form of a prompt, and the model generates a completion. The size of the context window will have a significant impact on the depth of the LLM predictions.
- Name
Context Window
- Type
- Total tokens at inference
- Description
The context window is the total number of tokens used during inference. This includes both the input prompt and generated output.
Early versions of GPT-3 and most open source models have a context window of 2048-4096 tokens. GPT-3.5 Turbo has up to 16k, GPT-4 Turbo has 128k, and Claude 2.1 has 200k.
- Name
Prompt
- Type
- Initial model input
- Description
A prompt provides initial input to steer the model's response in a particular direction. Like setting the stage, prompts focus the model on a specific topic, style or genre. This narrows the internal search space of the model.
- Name
Completion
- Type
- Model-generated response
- Description
The completion refers to the text generated by the model in response to the prompt. The content variability and length of the completion depends on the prompt and model config parameters like temperature and max tokens.
Hardware Requirements
Large language models require substantial computational resources for both training and inference. A cloud GPU server will cost between $100 and $1000 per day depending on the number of GPU and memory.
Training LLMs typically necessitates the use of multiple high-end GPUs, such as NVIDIA's A100 or H100 GPUs, which are renowned for their superior memory capacity (80 GB per card) and robust processing capabilities. Meta's LLaMA model (65B parameters), was trained on 2048 A100 GPUs with 80 GB over 21 days. Renting 2048 A100 GPUs from AWS (256 P4DE Instances) for 21 days would cost approximately $3.8M.
Inference breakthroughs enable running LLMs locally on Apple M1/2 hardware.
- Name
Training
- Type
- Millions of dollars
- Description
Training hardware requirements are a function of parameters, batch size, and training time. Computational power doubles every 3.4 months resulting in significant breakthroughs each year. GPT3 cost roughly $5M in 2020, but would cost $500k in 2023 – primarily due to software improvements.
- Name
Finetuning
- Type
- Thousands of dollars
- Description
Finetuning hardware requirements are a function of model size, batch size, and finetuning time.
Finetuning takes a pre-trained model and trains it on a new dataset. Finetuning is significantly faster and cheaper than training from scratch. The LoRA (Low-Rank Adaptation) method enables finetuning a 65B parameter model on a new instruction-following dataset in hours/days. Finetuning requires roughly 12x the GPU memory of the model size.
- Name
Inference
- Type
- Hundreds of dollars
- Description
Inference hardware requirements are a function of model size, context window, and concurrent inference requests.
Inference typically requires 2.1x the GPU memory of the model size. This is due to the need to store the model parameters and intermediate activations during the forward pass.
Quantization breakthroughs enable running LLMs on significantly smaller hardware requiring only 70% of the GPU memory of the model size.