What is Prompt Engineering? A Complete Guide for Beginners

Andrej Karpathy Prompt Engineering November 2, 2025 | 0

Complete Guide to LLM Comparison Parameters

Model Size (Parameters)
Training Data Size
Context Window Length
Performance on Benchmarks
Inference Speed
Cost (Training/Inference)
Modality Support
Openness (Open-Source vs. Proprietary)
Safety and Alignment
Energy Efficiency
Fine-Tuning Capabilities
Bias and Fairness Metrics
Robustness to Adversarial Inputs

Complete Analysis: This comprehensive guide explores all 13 critical parameters for evaluating Large Language Models. Each section includes technical details, practical implications, and authoritative references to help you make informed decisions when selecting LLMs for your applications.

Model Size (Parameters)

Model size, measured in parameters, represents the number of learnable weights in a neural network. These parameters encode the model's knowledge and determine its capacity for understanding and generating human-like text. Early transformer models contained hundreds of millions of parameters, while modern LLMs have scaled to trillions through architectural innovations.

Small Models (1-7B)

Large Models (500B+)

Architectural Innovation: Mixture of Experts

Modern models like DeepSeek-V3 (671B parameters) use Mixture-of-Experts architectures where only a fraction of parameters activate for each token. This approach maintains massive knowledge capacity while optimizing inference efficiency. While DeepSeek-V3 has 671B total parameters, it only uses 37B active parameters per token, dramatically reducing computational requirements.

Parameter Range	Typical Use Cases	Inference Hardware	Examples
1-7 Billion	Mobile apps, edge computing, personal assistants	Consumer GPUs, mobile chips	Llama 3 8B, Mistral 7B, Phi-3
8-70 Billion	Enterprise applications, specialized tools, research	Single high-end GPU	GPT-4, Claude 3, Gemini Pro
100+ Billion	Research, massive-scale applications, foundation models	Multi-GPU clusters	GPT-4 Turbo, Claude 3 Opus, Gemini Ultra

References & Further Reading

Training Data Size

The quantity, quality, and diversity of training data fundamentally shape model capabilities. While early LLMs trained on hundreds of gigabytes of text, contemporary models consume tens of terabytes from web pages, books, academic papers, code repositories, and scientific literature.

Training Data Scale Evolution (Tokens)

GPT-3 (2020)

300B tokens

Llama 2 (2023)

2T tokens

Mistral 8x22B (2024)

8.5T tokens

Gemini Ultra (2024)

13T+ tokens

Key Insight: Data Quality vs. Quantity

Recent research demonstrates that carefully curated, high-quality datasets often outperform larger but noisier training corpora. The trend is shifting toward sophisticated data filtering, deduplication, and synthetic data generation to overcome data scarcity in high-quality domains.

Context Window Length

Context window determines how much information a model can process and reference in a single interaction. From early transformers with 512-token limits to modern architectures handling millions of tokens, this parameter has seen the most dramatic scaling.

Short Context (4K tokens)

Extended Context (1M+ tokens)

Technical Innovation: Efficient Attention Mechanisms

Traditional transformer attention scales quadratically with sequence length, making long contexts computationally prohibitive. Innovations like FlashAttention, Ring Attention, and sliding window approaches have enabled orders-of-magnitude context expansion.

Context Length	Practical Applications	Technical Requirements	Example Models
4K-8K tokens	Brief conversations, email responses, short documents	Standard transformer, minimal optimization	GPT-3.5, Llama 2 7B
32K-128K tokens	Legal documents, technical manuals, codebases	Optimized attention, memory management	Claude 3, GPT-4 Turbo
200K-1M+ tokens	Research papers, books, complex multi-document analysis	Advanced attention, distributed processing	Gemini 1.5 Pro, Claude 3.5

Performance on Benchmarks

Standardized benchmarks provide objective measures of model capabilities across reasoning, knowledge, coding, and safety domains. However, benchmark performance must be interpreted critically, as optimization for specific tests doesn't always translate to real-world capability.

MMLU (Massive Multitask Language Understanding) Performance

GPT-4 (2023)

86.4%

Gemini Ultra

90.0%

Claude 3 Opus

88.1%

Benchmark Limitations

Many benchmarks suffer from "test set contamination" where models have seen evaluation data during training. Additionally, specialized prompting techniques and benchmark-specific optimizations can inflate scores without corresponding improvements in general capabilities.

Inference Speed

Inference speed measures how quickly a model can process input and generate output, typically measured in tokens per second (t/s). This parameter is crucial for real-time applications like chatbots, translation services, and interactive tools where latency directly impacts user experience.

Factors Affecting Inference Speed

Model architecture, parameter count, hardware optimization, batch size, and context length all significantly impact inference performance. Quantization and model distillation techniques can dramatically improve speed with minimal quality loss.

Optimization Techniques

FlashAttention, KV caching, speculative decoding, and model quantization are among the most effective methods for improving inference speed without sacrificing model quality.

Model Size	Typical Speed (t/s)	Hardware Requirements	Optimization Potential
7B parameters	50-100 t/s	Single GPU (e.g., RTX 4090)	High (4-bit quantization)
13B-34B parameters	20-50 t/s	High-end GPU (e.g., A100)	Medium (8-bit quantization)
70B+ parameters	5-20 t/s	Multi-GPU setup	Low (model parallelism)

References

Cost (Training/Inference)

The cost of developing and deploying LLMs includes both training expenses (one-time, substantial investment) and inference costs (ongoing, usage-based). Understanding these economics is crucial for business viability and resource allocation.

Training Cost Breakdown

Training a large foundation model can cost millions of dollars in compute resources alone. For example, GPT-4's training is estimated to have cost over $100 million. These costs include cloud computing fees, data acquisition, engineering time, and electricity.

API Pricing Comparison (per 1M tokens)

GPT-4 Turbo (input)

$10.00

Claude 3 Opus (input)

$15.00

Gemini Ultra (input)

$7.50

Llama 3 70B (self-hosted)

~$2.50*

*Estimated infrastructure and electricity costs

Cost Optimization Strategies

Effective cost management involves selecting the right model size for the task, implementing caching strategies, using smaller models for simpler tasks, and considering self-hosting for high-volume applications. Model quantization and efficient batching can reduce inference costs by 2-4x.

Modality Support

Modality support refers to a model's ability to process and generate different types of data beyond text, including images, audio, video, and structured data. Multimodal models can understand relationships across different data types, enabling more sophisticated applications.

Text-Only Models

Traditional LLMs that process and generate only text. Examples include GPT-3, Llama 2, and Mistral models. These remain highly effective for language-focused tasks.

Vision-Language Models

Models that understand both images and text, enabling tasks like visual question answering, image captioning, and document understanding. Examples include GPT-4V, Gemini Pro Vision, and LLaVA.

Audio-Visual-Language Models

Advanced models capable of processing text, images, and audio simultaneously. These can perform complex tasks like video description, audio transcription with context, and multimodal reasoning.

Modality Support	Key Capabilities	Example Models	Common Applications
Text-only	Text generation, translation, summarization	GPT-3.5, Llama 2, Claude Instant	Chatbots, content creation, coding assistants
Text + Image	Visual question answering, image description	GPT-4V, Gemini Pro, LLaVA	Document analysis, accessibility tools, e-commerce
Text + Image + Audio	Multimodal reasoning, video understanding	Gemini Ultra, GPT-4o	Video analysis, interactive education, complex AI assistants

References

Openness (Open-Source vs. Proprietary)

The openness spectrum ranges from fully open-source models with permissive licenses to completely proprietary systems with restricted access. This dimension affects customization, transparency, cost, and deployment options.

Fully Open Source

Fully Proprietary

Open Source Advantages

Full transparency, customization capabilities, no vendor lock-in, community contributions, and self-hosting options. Examples: Llama 2, Mistral, Falcon models.

Proprietary Advantages

Often superior performance, managed infrastructure, enterprise support, compliance certifications, and integrated ecosystems. Examples: GPT-4, Claude, Gemini.

License Type	Commercial Use	Modification Rights	Example Models
Permissive Open Source	Allowed	Full rights	Mistral 7B, Falcon 180B
Research/Non-commercial	Restricted	Full rights	LLaMA 2 (initially), BLOOM
Proprietary API	Via API only	No access	GPT-4, Claude 3, Gemini Ultra

Strategic Considerations

The choice between open-source and proprietary models depends on factors like data privacy requirements, customization needs, budget constraints, and technical expertise. Many organizations adopt a hybrid approach, using proprietary models for cutting-edge capabilities and open-source models for specific, customized applications.

Safety and Alignment

Safety and alignment refer to techniques and measures that ensure AI systems behave helpfully, honestly, and harmlessly according to human values and intentions. This includes preventing harmful outputs, reducing bias, and ensuring reliable behavior.

Alignment Techniques

Reinforcement Learning from Human Feedback (RLHF), Constitutional AI, red teaming, and value learning are primary methods for aligning LLMs with human values. These techniques help models refuse harmful requests, provide accurate information, and maintain appropriate boundaries.

Reinforcement Learning from Human Feedback (RLHF)

A three-step process involving supervised fine-tuning, reward model training, and reinforcement learning optimization. Used by OpenAI and Anthropic to align models with human preferences.

Constitutional AI

Anthropic's approach where models are trained to follow a set of principles or a "constitution" that guides their behavior, with minimal human feedback in the training loop.

Red Teaming

Systematic testing where dedicated teams attempt to make models produce harmful outputs, with results used to improve safety measures and training data.

Safety Evaluation Metrics

Standard safety evaluations measure refusal rates for harmful requests, truthfulness in responses, adherence to instructions, and resistance to jailbreaking attempts. Models are tested across categories like harassment, illegal advice, misinformation, and privacy violations.

References

Energy Efficiency

Energy efficiency measures the computational resources required for training and inference relative to model performance. As LLMs grow larger, their environmental impact and operational costs have become significant considerations.

Estimated Training Energy Consumption (GPT-3 Equivalent)

Small Model (7B)

~300 MWh

Medium Model (70B)

~1,000 MWh

Large Model (500B+)

~2,500+ MWh

Sustainable AI Practices

Efficient model architectures, sparsity, quantization, and renewable energy-powered data centers are reducing the environmental impact of LLMs. The AI community is increasingly focused on developing more efficient training methods and inference optimization techniques.

Efficiency Strategy	Energy Reduction	Performance Impact	Implementation Complexity
Model Quantization	2-4x	Minimal (with advanced methods)	Low
Architecture Optimization	3-5x	None (can improve)	High
Mixture of Experts	5-10x	None (can improve)	High
Knowledge Distillation	2-3x	Small degradation	Medium

Fine-Tuning Capabilities

Fine-tuning allows adapting pre-trained foundation models to specific domains, tasks, or styles through additional training on specialized datasets. This process dramatically improves performance on target tasks while maintaining general capabilities.

Full Fine-Tuning

Updates all model parameters using domain-specific data. Most effective but computationally expensive. Requires significant resources and carries risk of catastrophic forgetting.

Parameter-Efficient Fine-Tuning (PEFT)

Techniques like LoRA (Low-Rank Adaptation) that update only a small subset of parameters. Dramatically reduces computational requirements while maintaining most of the performance benefits.

Instruction Tuning

Training models to follow instructions and conversational patterns. Crucial for creating helpful chat assistants that understand and respond appropriately to user requests.

Fine-Tuning Method	Compute Requirements	Data Requirements	Typical Use Cases
Full Fine-Tuning	High (similar to pre-training)	Large (10K+ examples)	Domain adaptation, major capability shifts
LoRA / PEFT	Low (10-20% of full)	Medium (1-10K examples)	Task specialization, style adaptation
Prompt Tuning	Very Low (1-5% of full)	Small (100-1K examples)	Light customization, rapid prototyping

References

Bias and Fairness Metrics

Bias in LLMs refers to systematic errors or unfair treatment of certain groups, often reflecting biases present in training data. Measuring and mitigating bias is crucial for developing fair, equitable AI systems.

Sources of Bias

Training data imbalances, societal stereotypes reflected in text, annotation biases, and modeling choices can all introduce or amplify biases. These can manifest as demographic disparities, stereotyping, or unequal treatment across groups.

Demographic Bias

Performance differences across demographic groups in tasks like sentiment analysis, toxicity detection, or question answering. Measured using datasets like BOLD and BBQ.

Representation Bias

Unequal representation or stereotyping of social groups in model generations. Assessed through prompt-based tests measuring stereotype reinforcement.

Allocational Bias

Unequal allocation of resources or opportunities in model recommendations or decisions. Particularly important for high-stakes applications like hiring or lending.

Bias Category	Measurement Approach	Common Benchmarks	Mitigation Strategies
Demographic	Performance disparity analysis	BOLD, BBQ, CrowS-Pairs	Data balancing, adversarial training
Representational	Sterotype measurement in generations	StereoSet, SEAT	Counter-stereotypical training, debiasing
Allocational	Decision fairness across groups	Custom task-specific evaluations	Fairness constraints, equalized odds

References

Robustness to Adversarial Inputs

Robustness measures how well models maintain performance and safety when faced with unusual, noisy, or deliberately adversarial inputs. This includes resistance to prompt injection, jailbreaking, and distribution shifts.

Adversarial Attacks

Deliberately crafted inputs designed to bypass model safeguards or produce incorrect outputs. Common techniques include prompt injection, jailbreaking, and semantic perturbations.

Distribution Shifts

Performance degradation when models encounter inputs different from their training distribution. This includes domain adaptation challenges and temporal distribution shifts.

Defense Strategies

Adversarial training, input sanitization, ensemble methods, and anomaly detection can improve robustness. Regular red teaming helps identify and address vulnerabilities.

Vulnerability Type	Example Attacks	Impact	Defense Approaches
Jailbreaking	Role-playing, encoding, special tokens	Safety bypass, harmful content generation	RLHF, constitutional AI, input filtering
Prompt Injection	Instruction overwriting, context poisoning	Unauthorized actions, data extraction	Prompt separation, privilege control
Semantic Attacks	Paraphrasing, typographical errors	Performance degradation, incorrect outputs	Data augmentation, adversarial training

Robustness Evaluation

Comprehensive robustness evaluation includes testing on out-of-distribution data, adversarial examples, and edge cases. Benchmarks like AdvGLUE and ANLI provide standardized ways to measure model robustness across different types of challenges.

References

Key Insights for Model Selection

Choosing the right LLM requires balancing multiple competing parameters based on specific use cases. No single model excels across all dimensions, and the optimal choice depends on deployment constraints, performance requirements, and cost considerations.

For real-time applications: Prioritize inference speed and cost over maximum capability. For research and analysis: Focus on context length and reasoning benchmarks. For production systems: Consider fine-tuning capabilities and robustness. As the field evolves, new architectures and training approaches continue to reshape these trade-offs.

Andrej Karpathy

Andrej Karpathy is a computer scientist and AI educator known for his work in deep learning and computer vision. A founding member of OpenAI and former Director of AI at Tesla, he now leads Eureka Labs, advancing AI education through courses like LLM101n and his “Zero to Hero” series.

What is Prompt Engineering? A Complete Guide for Beginners

Table of Contents

Model Size (Parameters)

Architectural Innovation: Mixture of Experts

References & Further Reading

Training Data Size

Training Data Scale Evolution (Tokens)

Key Insight: Data Quality vs. Quantity

Context Window Length

Technical Innovation: Efficient Attention Mechanisms

Performance on Benchmarks

MMLU (Massive Multitask Language Understanding) Performance

Benchmark Limitations

Inference Speed

Factors Affecting Inference Speed

Optimization Techniques

References

Cost (Training/Inference)

Training Cost Breakdown

API Pricing Comparison (per 1M tokens)

Cost Optimization Strategies

Modality Support

Text-Only Models

Vision-Language Models

Audio-Visual-Language Models

References

Openness (Open-Source vs. Proprietary)

Open Source Advantages

Proprietary Advantages

Strategic Considerations

Safety and Alignment

Alignment Techniques

Reinforcement Learning from Human Feedback (RLHF)

Constitutional AI

Red Teaming

Safety Evaluation Metrics

References

Energy Efficiency

Estimated Training Energy Consumption (GPT-3 Equivalent)

Sustainable AI Practices

Fine-Tuning Capabilities

Full Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT)

Instruction Tuning

References

Bias and Fairness Metrics

Sources of Bias

Demographic Bias

Representation Bias

Allocational Bias

References

Robustness to Adversarial Inputs

Adversarial Attacks

Distribution Shifts

Defense Strategies

Robustness Evaluation

References

Key Insights for Model Selection

Andrej Karpathy

Leave a Reply Cancel reply

Inside the Hub

Terms & Policies

Resources

Perfect Prompt Hub