Building Code Agents with Hugging Face smolagents

In the fast-evolving world of AI, agents have emerged as one of the most exciting frontiers. Thanks to projects like Hugging Face's smolagents, building specialized, secure, and powerful code agents has never been easier. In this post, we'll walk through the journey of agent development, explore how to build code agents, discuss secure execution strategies, learn how to monitor and evaluate them, and finally, design a deep research agent.

A Brief History of Agents

Agents have evolved dramatically over the past few years. Early LLM applications were static: users asked a question; models generated an answer. No memory, no decision-making, no real "agency."

But researchers dreamed of more: systems that could plan, decide, adapt, and act autonomously.

We can think of agency on a continuum:

Level 0: Stateless response (classic chatbots)
Level 1: Short-term memory and reasoning (ReAct pattern)
Level 2: Long-term memory, dynamic tool use
Level 3: Recursive self-improvement, autonomous goal setting (still experimental)

Early attempts at agency faced an "S-curve" of effectiveness. Initially, more agency added more confusion than benefit. But with improvements in prompting, tool use, and memory architectures, we're now climbing the second slope: agents are finally becoming truly effective.

Today, with frameworks like smolagents, you can build capable agents that write, execute, and even debug code in a secure and monitored environment.

Introduction to Code Agents

Code agents are agents specialized to generate and execute code to achieve a goal. Instead of just answering, they act programmatically.

Let's build a basic code agent with Hugging Face's smolagents:

from smolagents import Agent

agent = Agent(system_prompt="You are a helpful coding agent. Always solve tasks by writing Python code.")

response = agent.run("Write a function that calculates the factorial of a number.")

print(response)

What's happening?

We initialize an Agent with a system prompt.
We run a user query.
The agent responds by writing and executing Python code.

Sample Output:

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)

Secure Code Execution

Running arbitrary code is risky. Even a well-meaning agent could:

Try to use undefined commands.
Import dangerous modules.
Enter infinite loops.

To build safe agents, we must:

Capture Exceptions:

try:
    exec(agent_code)
except Exception as e:
    print(f"Error occurred: {e}")

Filter Non-Defined Commands:
Use a restricted execution environment, e.g., exec with a sanitized globals and locals dictionary.
Prevent OS Imports:
Scan code for forbidden keywords like os, subprocess, etc.
Or disable built-ins selectively.
Handle Infinite Loops:
Run code in a separate thread or process with timeouts.
Sandbox Execution:
Use Python's multiprocessing or even Docker-based isolation for truly critical applications.

Example Secure Exec:

import multiprocessing

def safe_exec(code, timeout=2):
    def target():
        try:
            exec(code, {"__builtins__": {"print": print, "range": range}})
        except Exception as e:
            print(f"Execution error: {e}")

    p = multiprocessing.Process(target=target)
    p.start()
    p.join(timeout)
    if p.is_alive():
        p.terminate()
        print("Terminated due to timeout!")

Monitoring and Evaluating the Agent

Good agents aren't just built; they are monitored and improved over time.

Enter Phoenix.otel — an open telemetry-based tool to monitor LLM applications.

Key Metrics to Track:

Latency (response time)
Success/Error rates
Token usage
User feedback

Integration Example:

from phoenix.trace import init_tracing

init_tracing(service_name="code_agent")

# Your agent code here
agent.run("Write a quicksort algorithm.")

With this, every agent interaction is automatically traced and sent to your telemetry backend.

You can visualize execution traces, errors, and resource usage to continuously fine-tune the agent.

Building a Deep Research Agent

Sometimes, writing code isn't enough — agents need to research, retrieve information, and act based on live data.

We can supercharge our code agent with Tavily Browser, a retrieval-augmented generation (RAG) tool that lets agents browse the web.

Example:

from smolagents import Agent
from tavily import TavilyBrowser

browser = TavilyBrowser()
agent = Agent(
    system_prompt="You are a deep research coding agent.",
    tools=[browser]
)

response = agent.run("Find the latest algorithm for fast matrix multiplication and implement it.")
print(response)

Now your agent can:

Search academic papers.
Extract up-to-date methods.
Code the solution dynamically.

Building agents that combine reasoning, execution, and real-world retrieval unlocks a whole new level of capability.

Final Thoughts

We are entering a new era where agents can autonomously reason, code, research, and improve.

Thanks to lightweight frameworks like Hugging Face's smolagents, powerful browsing tools like Tavily, and robust monitoring with Phoenix.otel, building secure, powerful, and monitored code agents is now within reach for any developer.

The frontier of autonomous programming is wide open.

What will you build?