Understanding LLM in LangChain

LangChain connects your code to large language models (LLMs). Before you build anything serious, you need a solid understanding of what these models are, how they work at a basic level, and how LangChain talks to them. This knowledge prevents confusion when you see unexpected behavior and helps you make better decisions when building applications.

What Is a Large Language Model

A large language model is a computer program trained on massive amounts of text — books, websites, articles, code, and more. During training, the model learns patterns in human language: how words relate to each other, how sentences form, how ideas connect. After training, the model predicts the most likely next word given the words that came before it. It does this prediction millions of times per second to generate coherent text.

The Autocomplete Analogy

Your phone's autocomplete suggests the next word as you type. An LLM does the same thing, but at a scale and sophistication level that makes the output sound like a thoughtful, knowledgeable human wrote it.

Autocomplete on your phone:
  "I want to go to the" → "store"  (simple, one word)

LLM completing a prompt:
  "Explain how photosynthesis works" →
  [generates a full, accurate, well-structured explanation
   drawing on patterns learned from thousands of science texts]

The fundamental mechanism is the same. The difference is scale, training data, and model architecture.

Tokens: The Unit of Text LLMs Use

LLMs do not process text character by character or word by word. They process tokens. A token is roughly 0.75 words on average, but the exact size varies. Common words like "the" and "is" are single tokens. Longer or rarer words get split into multiple tokens. Spaces and punctuation can also be tokens.

"LangChain is powerful"
 ─────────────────────
 Token 1: "Lang"
 Token 2: "Chain"
 Token 3: " is"
 Token 4: " powerful"
 = 4 tokens total

"Hello"
 ──────
 Token 1: "Hello"
 = 1 token

Why does this matter? API pricing is based on tokens. A context window limit (how much text the model can process at once) is measured in tokens. When you build a LangChain application that feeds documents into the model, you need to know roughly how many tokens your text uses so you do not exceed the model's limit.

Quick Token Estimation

Text Length        Approximate Tokens
──────────────────────────────────────
1 page of text     ~500 tokens
Short novel        ~100,000 tokens
Long document      May exceed model limit

Two Types of Models in LangChain

LangChain distinguishes between two main model types. Understanding the difference prevents confusion when reading documentation or code examples.

LLMs (Text Completion Models)

The older style. You send a text string in and get a text string back. The model treats the whole interaction as a text completion task. Example: you send "The capital of France is" and it completes the sentence with "Paris."

Chat Models

The modern style. You send a structured list of messages — each labeled as either coming from the system, the human user, or the AI assistant. The model understands the conversation structure and responds accordingly. ChatGPT uses this format. Most modern applications use Chat Models.

LLM (older style):
  Input:  "Translate 'hello' to Spanish:"
  Output: "hola"

Chat Model (modern style):
  Messages sent:
  ┌─────────────────────────────────────────────┐
  │ system:    "You are a helpful translator."  │
  │ human:     "Translate 'hello' to Spanish."  │
  └─────────────────────────────────────────────┘
  Output:
  ┌─────────────────────────────────────────────┐
  │ assistant: "The Spanish word for 'hello'    │
  │             is 'hola'."                     │
  └─────────────────────────────────────────────┘

This course focuses on Chat Models because they are the standard for modern applications. LangChain provides a consistent interface for both types, so the skills transfer easily.

How LangChain Connects to AI Models

LangChain uses a concept called model integrations. Each AI provider (OpenAI, Google, Anthropic, Mistral, Hugging Face) has a separate LangChain package that handles the specific details of that provider's API. Your code calls a standard LangChain interface, and the integration handles the translation to that provider's format.

Your LangChain Code
        │
        │  model.invoke(messages)
        │
        ▼
┌─────────────────┐
│  LangChain      │
│  Interface      │
│  (ChatModel)    │
└────────┬────────┘
         │
    ┌────┴─────────────────────────────┐
    │                                  │
    ▼                                  ▼
OpenAI Integration            Google Integration
(langchain-openai)            (langchain-google-genai)
    │                                  │
    ▼                                  ▼
OpenAI API                    Google Gemini API
(sends HTTP request)          (sends HTTP request)
    │                                  │
    ▼                                  ▼
GPT-4 Model                   Gemini Pro Model

This architecture means your application logic does not change when you switch AI providers. You change one line (the model class name and model string) and everything else stays the same.

The Three Message Types in Chat Models

When you send messages to a Chat Model, each message has a role. LangChain uses specific classes for each role.

SystemMessage

Sets the behavior and persona of the AI for the entire conversation. This is where you tell the AI to act as a customer service agent, a Python expert, a friendly tutor, or anything else your application needs. The system message is usually set once and stays fixed throughout the conversation.

HumanMessage

Represents text coming from the user. Every question, request, or input from the end user of your application becomes a HumanMessage.

AIMessage

Represents the AI's previous responses. When you implement memory (covered in a later topic), you include past AIMessages in the conversation list so the model knows what it already said.

A conversation in LangChain:

[
  SystemMessage("You are a helpful cooking assistant."),
  HumanMessage("What goes well with garlic?"),
  AIMessage("Garlic pairs well with olive oil, tomatoes, and herbs..."),
  HumanMessage("Can I add it to pasta?"),
]

The model sees the full list and generates the next reply
with the full context of the conversation.

Key Model Parameters

When you create a model object in LangChain, you can configure several parameters that control the model's behavior.

model (required)

The specific model version to use. Examples: "gpt-4o", "gpt-3.5-turbo", "gemini-pro". Each model has different capabilities, speed, and cost.

temperature

Controls how creative or random the output is. Values range from 0 to 1 (some models allow up to 2). A temperature of 0 makes the model deterministic — it almost always gives the same answer to the same question. A temperature of 0.9 makes the model more creative and varied.

temperature = 0.0
  Question: "What is 2 + 2?"
  Answer:   "4"   (always the same)

temperature = 0.9
  Question: "Write a sentence about coffee."
  Answer:   (different every time, more creative)

For factual tasks (answering questions about your documents, extracting data), use a low temperature. For creative tasks (writing, brainstorming, generating ideas), use a higher temperature.

max_tokens

Limits the length of the model's response. If the model reaches this limit, it stops mid-sentence if necessary. Setting this prevents unexpectedly long (and expensive) responses. A good default for most conversational tasks is 1000 to 2000 tokens.

timeout

How many seconds to wait for a response before giving up. Useful in production applications where you cannot let a slow API response freeze your entire application.

from langchain_openai import ChatOpenAI

model = ChatOpenAI(
    model="gpt-4o",
    temperature=0.2,
    max_tokens=1000,
    timeout=30
)

Synchronous vs Streaming Responses

By default, LangChain waits for the model to generate the complete response before returning it to your code. This is called synchronous behavior. For short responses it feels instant. For long responses, there is a visible wait.

Streaming sends the response back word by word (token by token) as it is generated, just like how ChatGPT types out answers progressively. LangChain supports streaming with a small change to how you call the model.

Standard (waits for full response):
  model.invoke(messages)
  → waits 3 seconds →
  → returns full text at once

Streaming (returns as it generates):
  model.stream(messages)
  → returns word... by... word...
  → (user sees text appearing in real time)

For command-line scripts, standard invoke is fine. For web applications where users watch a chatbox, streaming creates a much better user experience. LangChain makes switching between the two trivial.

Model Comparison: Which to Use

Model           Speed    Cost    Best For
──────────────────────────────────────────────────────
GPT-3.5-turbo   Fast     Low     Simple tasks, testing
GPT-4o          Medium   Medium  Most production uses
GPT-4           Slow     High    Complex reasoning
Gemini Pro      Fast     Free*   Budget-conscious users
Claude Haiku    Fast     Low     Speed-critical apps
Claude Sonnet   Medium   Medium  Balanced performance

* Free tier with limits

Start with GPT-3.5-turbo or Gemini Pro while learning. They are fast and cheap (or free), which lets you experiment without worrying about API costs. Switch to a more capable model when your application's needs demand it.

Context Window: The Model's Working Memory

Every LLM has a context window — the maximum amount of text it can process in a single call. This includes your prompt, any documents you feed it, and its own response.

Context Window = Prompt + Documents + Response

GPT-3.5-turbo:  16,000 tokens  (~12,000 words)
GPT-4o:        128,000 tokens  (~96,000 words)
Gemini 1.5:  1,000,000 tokens  (~750,000 words)

When you build a document question-answering system, you cannot just dump an entire 200-page PDF into the context window — it would exceed the limit and cost a fortune even if it fit. LangChain provides smart ways to search your documents and only send the relevant sections (covered in the Document Loaders and Embeddings topics).

Practical Example: Calling a Model with Different Settings

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

load_dotenv()

# Model for factual, consistent answers
factual_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Model for creative writing
creative_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.9)

messages = [
    SystemMessage("You are a helpful assistant."),
    HumanMessage("Give me one word to describe the ocean.")
]

factual_response = factual_model.invoke(messages)
creative_response = creative_model.invoke(messages)

print("Factual:", factual_response.content)
print("Creative:", creative_response.content)

Run this script a few times. The factual model (temperature=0) returns the same or very similar word each time. The creative model (temperature=0.9) varies more. This hands-on observation builds intuition for how temperature affects output.

Model Response Object

When you call model.invoke(), the return value is not just a string. It is an AIMessage object with several useful attributes.

response = model.invoke(messages)

response.content         # The text reply (string)
response.response_metadata  # Token usage, model name
response.id              # Unique identifier for this response

The response_metadata attribute contains token usage information — how many tokens were in your input (prompt_tokens) and how many the model generated (completion_tokens). Monitoring this helps you manage costs and understand why certain calls are slow or expensive.

Why LangChain Adds Value Over Direct API Calls

You could call the OpenAI API directly without LangChain. Many tutorials show exactly that. LangChain's value appears when you combine model calls with other components.

Direct API call:
  ┌─────────────┐        ┌──────────────┐
  │  Your code  │───────▶│  OpenAI API  │
  └─────────────┘        └──────────────┘

LangChain application:
  ┌─────────────┐
  │  Your code  │
  └──────┬──────┘
         │
  ┌──────▼──────────────────────────────────┐
  │  LangChain                              │
  │  ┌────────┐  ┌─────────┐  ┌──────────┐  │
  │  │ Memory │  │Retriever│  │  Tools   │  │
  │  └────────┘  └─────────┘  └──────────┘  │
  └──────────────────┬──────────────────────┘
                     │
              ┌──────▼──────┐
              │  AI Model   │
              └─────────────┘

The moment your application needs to remember conversations, search documents, or use external tools — all common requirements in real products — LangChain's abstraction layer saves you days of work.

Summary

Large language models process text as tokens, have a context window limit, and come in two styles: text completion and chat. LangChain wraps these models with a consistent interface, letting you swap providers without rewriting your code. You control model behavior through parameters like temperature and max_tokens. The model returns an AIMessage object containing the response text and usage metadata.

Previous lesson

Back to course

Next lesson