Building an Intelligent Agent Framework from Scratch: An Intriguing Challenge for Automatic Problem Solving

May 22, 2024

Intelligent agent frameworks may very well be the next milestone in artificial intelligence, integrating models into a cohesive system capable of autonomously manipulating external resources to solve a myriad of problems. From Langchain to Llama-index, popular frameworks have garnered much attention. However, their design philosophies often remain stuck in the Airflow era, relying on predefined pipeline scaffolding, forcing developers to write additional code for new use cases. It seems that the more "intelligent" these systems are, the more human intervention they require.

A truly intelligent agent framework should be adaptive and flexible, like AutoGen, MetaGPT, phidata, and CrewAI. Among these, MetaGPT's architecture stands out as particularly impressive.

The applications that can be generated by MetaGPT

In vertical domains, we have seen efforts like Devin AI Developer. Although some demonstrations might be exaggerated, they at least show that teams are exploring this direction with enthusiasm.

However, these projects are typically built by top-tier teams working full-time. So, what kind of intelligent agent framework can one person develop in their spare time? Recently, I decided to find out by attempting to build an intelligent agent framework from scratch. What follows are three use cases that, while relatively simple, effectively demonstrate the dynamic problem-solving capabilities of this framewor

Use Case 1: Calculating How Much NVIDIA Stock You Can Buy

This example illustrates that no matter how powerful a model is, it cannot solve a problem in a single inference step, especially when interaction with the external world is required. The model must take into account external feedback to proceed with the next steps.

In this scenario, the agent first breaks down the user's request into smaller tasks, then identifies suitable tools to address each sub-task. Moreover, the information from previous steps (such as objectives and results) is passed to subsequent steps, ensuring that each step is aware of the overall goal.

It is important to note that the tools invoked here, and even their existence, are not predefined but are dynamically discovered and utilized by the intelligent agent. This design approach offers the system broad adaptability and allows new external resources (including tools and knowledge) to be added at any time without disrupting the agent's operation.

Use Case 2: Plotting Global Birth Rates

Similar in logic to Use Case 1, this example demonstrates the intelligent agent's diverse output capabilities. The agent can present results in various ways, such as generating a time series plot. A recent example of this type of functionality is ChatGPT's new data analysis feature, which can address such tasks.

In this scenario, the agent generates an image. However, the framework is easily extensible to produce various types of content, such as reports, animations, etc., which can then be stored anywhere—emailed, uploaded to the cloud, or sent through messaging apps. Moreover, this generated content can serve as output for other downstream tasks and be utilized by other agents.

Use Case 3: Tracking OpenAI Employee Movement

This use case is purely for gossip but follows the same logic as Use Case 1. By searching for OpenAI departure information online and then looking up each person's background, the intelligent agent showcases its robust data integration and processing capabilities.

Although this example may not have practical value, it demonstrates the agent's ability to handle real-time data and integrate information. This capability can be extended to applications like real-time social media monitoring, automating various scenarios, and providing users with timely and valuable information.

Why Build an Agent Framework from Scratch?

It's clear that a framework developed by a team working full-time is bound to be more polished than one built by a single person in their spare time. So why did I decide to build an agent framework from scratch?

1. Deepening Understanding of Agent Principles

The best way to learn something is by doing it. By constructing an intelligent agent framework from the ground up, I can delve deeply into the technical details and validate theoretical hypotheses through practical application. This journey from zero to one is an immersive and comprehensive learning experience.

There's a method called the Feynman Technique, where explaining knowledge to others enhances understanding. In the realm of computer science, I prefer the "Linus Learning Method": don't just talk about it, code it.

2. Overcoming the Limitations of Existing Frameworks

Current popular frameworks like Langchain and Llama-index rely heavily on predefined pipeline structures. While effective, these structures lack flexibility and scalability. Developers often need to write additional code to handle new tasks or tools, leading to bloated frameworks. An ideal intelligent agent framework should be adaptive, able to dynamically plan based on the current environment and available resources, even when those resources are limited.

True innovation lies in breaking free from existing paradigms and finding more fundamental, universally applicable solutions. This challenge is not just technical but also a reshaping of our approach to problem-solving.

3. Declarative Problem Solving

Declarative Problem Solving aims to let developers solve complex problems without writing code. Although this path is long and arduous, we must take the first step. Current frameworks like Langchain still require coding to address new tasks, whereas we aspire to solve complex problems through simple conversation.

By redesigning the framework, we aim to significantly lower the barrier to problem-solving, enabling agents to autonomously complete tasks with minimal human intervention. This can greatly enhance development efficiency and democratize technology. The ultimate goal of technology is not just to improve efficiency but to liberate human creativity, making technology an extension and amplifier of our thinking.

Framework Characteristics

The implemented agent framework possesses the following characteristics:

Model Agnostic: The framework can be configured to work with multiple large language models without needing code changes.
Tool Agnostic: It can flexibly call various tools without pre-integrating them into the framework. In this framework, tools and external knowledge are treated equally, merely as data. The framework is highly extensible, allowing tools to be dynamically discovered and utilized via interface implementations.
Knowledge Agnostic: The framework can dynamically acquire and utilize external information without pre-existing knowledge.

Small Steps to Great Leaps

Currently, the functionalities showcased are still quite basic. During testing, the framework revealed several issues:

Sensitivity to User Descriptions: Sometimes, a slight change in description leads to entirely different plans and tool calls. This indicates that prompt engineering still needs improvement. Auto-prompting technology might help mitigate this issue.
Challenges in Tool Discovery: Tools are considered external knowledge in this framework. The framework has no pre-knowledge of available tools and searches for them when solving problems. While this ensures tool agnosticism, it presents significant challenges in tool discovery. As the number of tools increases, retrieval efficiency can decrease, similar to issues observed with ChatGPT, indicating that this is still an unresolved area.
Memory Challenges: The framework's memory capabilities are currently simple, merely recording the results of previous steps. This approach can consume significant input space in large language models when dealing with complex problems.
Contextless Planning: The current framework relies on predefined plans. If the plan is correct, the problem is solved; if incorrect, it cannot proceed. This method works for simple problems but has a high failure rate for complex issues.
- A better approach would be dynamic planning with backtracking support, allowing the framework to solve problems through trial and error.
- Additionally, it is best to discover available tools before making plans, enabling more targeted planning.

Technology as an Extension and Evolution

Historically, technological advancements have extended human capabilities. Cars and airplanes extend and automate human physical abilities, while computers extend and automate human cognitive processes. Early computer technology implemented human task execution ideas through programming, which we call algorithms. This includes widely used techniques like quicksort, dynamic programming, and backtracking algorithms. However, these algorithms typically solve specific types of problems, requiring programmers to combine them to address more complex issues.

Machine learning further abstracts the process of mimicking human thought to a higher level. Today, machine learning models not only perform single tasks but also combine fundamental components to solve more complex problems. This enables us to create more general and adaptive intelligent systems capable of handling a broad range of complex tasks.

Some may worry about the role of humans in this context. This is a profound question. Despite technological advances, the human role will not disappear. Instead, our role will shift from executors to designers and supervisors, ensuring that technology develops in line with our values and goals.

Yexi's Small Thinking

Discussion about this post