Putting It All Together: Lessons from Building a LangGraph Agent

Backend Engineer & AI/ML Developer passionate about building scalable APIs, cloud systems, and LLM-powered applications. Sharing insights on Python, Django, FastAPI, LangChain, and deploying AI in production. I love writing about: Backend Engineering: Python, Django, FastAPI, REST APIs, Celery, PostgreSQL, AWS, Docker AI/ML Applications: LLMs, LangChain, Prompt Engineering, NLP, Vector Databases, MLOps Scaling Products: Payment integration, asynchronous systems, and performance optimization On this blog, I’ll share lessons learned, tutorials, and real-world case studies from my journey building production-ready backends and AI applications. My goal is to make complex concepts practical, actionable, and beginner-friendly — especially for engineers looking to move from theory to real-world deployment.
This final part covers main.py — the CLI entry point that ties everything together followed by a real end-to-end session trace, and then three production-readiness topics that this project raises but doesn't fully solve: streaming, context window growth, and model swapping. We'll close with an honest take on when LangGraph is the right tool and when it isn't.
The entry point: main.py
Every file in this project has been shown except one. main.py is the CLI that launches the agent, the loop that takes user input, invokes the graph, and prints the response. It's intentionally thin:
# main.py
from agent.graph import build_graph
WELCOME_BANNER = """
╔══════════════════════════════════════════════════════════════╗
║ City Events Agent ║
║ ║
║ I can help you with: ║
║ • Finding local events in cities worldwide ║
║ • Searching the web for information ║
║ • Checking current weather conditions ║
║ ║
║ Type 'quit' or 'exit' to leave. ║
╚══════════════════════════════════════════════════════════════╝
"""
def run_cli():
print(WELCOME_BANNER)
graph = build_graph()
print("Agent ready. Ask me anything!\n")
while True:
try:
user_input = input("You: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGoodbye! ")
break
if not user_input:
continue
if user_input.lower() in ("quit", "exit", "q"):
print("Goodbye!")
break
try:
result = graph.invoke(
{"messages": [{"role": "user", "content": user_input}]}
)
assistant_msg = result["messages"][-1].content
print(f"\nAssistant: {assistant_msg}\n")
except Exception as exc:
print(f"\n Error: {exc}\n")
if __name__ == "__main__":
run_cli()
Here are a few deliberate choices here worth noting.
The graph is built once, outside the loop. build_graph() is called before the while True loop begins. This matters because init_chat_model and bind_tools initialise the LLM client and generate the tool schemas. Doing this on every user message would add latency and waste resources. Build once, invoke many.
Each invocation is stateless. Notice that every call to graph.invoke() passes only the current user message, not the full conversation history. This means the agent has no memory between turns. Each question is answered in isolation. For a CLI demo this is fine, but it's the first thing you'd fix for a real conversational product. We'll come back to this.
The last message is the response. After graph.invoke() returns, result["messages"][-1].content gives us the final AI message. By the time the graph reaches END, the last item in the messages list is always the assistant's natural-language response — everything that came before it (tool calls, tool results, intermediate AI messages) is accumulated state that the LLM used to produce that final answer.
Errors are caught at the CLI level. The try/except around graph.invoke() is a last-resort catch. Individual tools handle their own failures internally and return structured errors. This outer catch exists for unexpected failures in the graph itself, like a bad API key that prevents the LLM from initialising, a network failure that Tavily doesn't handle internally, and so on.
Here's what a real session looks like with the agent running. The user asks a multi-part question:
╔══════════════════════════════════════════════════════════════╗
║ City Events Agent ║
╚══════════════════════════════════════════════════════════════╝
Agent ready. Ask me anything!
You: What's on in Lagos this weekend, and what's the weather like?
Assistant: Here's what I found for Lagos this weekend:
Events:
• Afrobeats Night at Eko Hotel — Saturday 8pm
• Lagos Art Fair, Victoria Island — Saturday & Sunday, 10am–6pm
• Comedy at Terra Kulture — Sunday 7pm
Weather:
Currently 31°C with light clouds. Humidity at 74%,
light breeze from the southwest. Good conditions for
outdoor events.
You: quit
Goodbye!
Two tool calls, one coherent response. The LLM queried the database, fetched the weather, and synthesised both into a single answer without any explicit orchestration from our side. That's the graph doing its job.
Three things You would consider in production
1. The agent has no memory between turns
As noted above, each graph.invoke() call starts fresh. The conversation has no continuity. if a user asks "What about Abuja?" after asking about Lagos, the agent has no idea what "what about" refers to.
LangGraph solves this with checkpointers. A checkpointer persists graph state between invocations using a thread ID. The simplest version uses in-memory storage:
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver()
graph = builder.compile(checkpointer=checkpointer)
# Each invocation now carries a thread ID
config = {"configurable": {"thread_id": "session_001"}}
result = graph.invoke(
{"messages": [{"role": "user", "content": user_input}]},
config=config
)
With a checkpointer in place, LangGraph automatically loads the previous state for that thread ID before each invocation and saves the updated state after. The full conversation history persists across turns without any changes to your nodes, tools, or edges. For production, swap MemorySaver for SqliteSaver or PostgresSaver to persist state to disk.
2. Context window growth
The agent's state grows with every turn. In Part 2, we saw that a single three-tool query produces eight messages. Ten turns of conversation with multi-tool queries could put 50 to 80 messages in the context window which is well within limits for a single session, but worth planning for in a longer-lived agent.
There are two standard approaches:
Message trimming:
LangChain provides trim_messages(), which truncates the message list to fit within a token budget while preserving the system message and the most recent exchanges. You'd apply it inside the chatbot node before calling the LLM:
from langchain_core.messages import trim_messages
def chatbot(state: State):
trimmed = trim_messages(
state["messages"],
max_tokens=4096,
strategy="last",
token_counter=llm,
)
return {"messages": [llm_with_tools.invoke(trimmed)]}
Summarisation:
Instead of trimming, you can add a summarisation node that periodically compresses older messages into a single summary message. This preserves context that trimming would discard, at the cost of a second LLM call. For most conversational agents, trimming is sufficient and simpler.
For the City Events Agent as built , which is just a CLI tool with short sessions , neither is needed. But if you extend it to a web API with persistent sessions, this becomes a real concern around the 20-turn mark.
3. Streaming responses
Right now, graph.invoke() blocks until the entire execution completes; all tool calls resolved, final LLM response generated before printing anything. For queries that hit three tools, this can feel sluggish. The user sees nothing for several seconds, then gets the full response at once.
LangGraph's graph.stream() method fixes this by yielding events as they happen. You can stream at different granularities , that is, per node completion, or per token if the LLM supports it:
# Stream node-level events
for event in graph.stream(
{"messages": [{"role": "user", "content": user_input}]},
config=config
):
for node_name, node_output in event.items():
# node_output is the state update from that node
last_msg = node_output["messages"][-1]
if hasattr(last_msg, "content") and last_msg.content:
print(last_msg.content)
# For token-level streaming, use astream_events (async)
async for event in graph.astream_events(
{"messages": [{"role": "user", "content": user_input}]},
config=config,
version="v2"
):
if event["event"] == "on_chat_model_stream":
chunk = event["data"]["chunk"].content
print(chunk, end="", flush=True)
Token-level streaming requires an async runtime and an LLM that supports streaming (most modern ones do). For a CLI tool, node-level streaming ,which shows tool results as they arrive before the final response is often enough to make the agent feel responsive without the added complexity of async Python.
When LangGraph is the wrong tool
Every tool has a domain where it's the right answer and a domain where it's overkill. LangGraph is no exception.
Use LangGraph when your application genuinely needs loops, conditional branching, or persistent state across multiple LLM calls. Multi-tool agents like this one, autonomous research pipelines, workflow automation with human-in-the-loop checkpoints, and multi-agent systems, these are exactly what LangGraph is designed for.
Don't use LangGraph when your application is a single LLM call, a fixed sequence of steps, or a simple RAG pipeline. A single prompt-response cycle doesn't need a state machine. A document summariser that chunks, embeds, retrieves, and generates in a fixed order is better expressed as a plain function or a simple LangChain LCEL chain. The graph adds conceptual overhead without adding capability.
If you've been following along and building as you go, you now have a working multi-tool LangGraph agent and a mental model for extending it. That's the goal. The framework will evolve; the model of stateful graphs for agentic applications is here to stay.



