From ChatGPT to Smart Agents: The Next Frontier in App Integration
Milos Svana
9 months ago
It has been over a year since OpenAI introduced ChatGPT and brought the power of AI and large language models (LLMs) to the average consumer. But we could argue that introducing APIs for seamlessly integrating large language models into apps developed by companies and independent hackers all over the world can be the true game changer in the long term. Developers are having heated discussions about how we can utilize this technology to develop truly useful apps that provide real value instead of just copying what OpenAI does. We want to contribute to this discussion by showing you how we think about developing autonomous agents at profiq. But first a bit of background.
Cognitive architectures and agents
In his recent blog post, the creator of LangChain talks about different “cognitive architectures”. Cognitive architecture describes how an LLMs-based app is orchestrated. The two biggest questions to consider in this context are: (1) how to provide context to the LLM and (2) how the LLM-based application makes decisions. Each cognitive architecture answers these questions differently.
The blog post organizes available cognitive architectures into a hierarchy based on how much decision-making power they give to the LLM. Single API calls are at the bottom. The LLM provides some output, but the decision about when to call the LLM, what the prompt will look like, and what to do with the output is hardcoded in the application.
The absolute top of the hierarchy is occupied by agents. Agents use LLMs to be as autonomous as possible. The user only provides a high-level description of a task they want to accomplish. Agents then autonomously “think” about what sequences of steps would lead to completing the task. They perform all individual steps and use the output of previous steps to reason about what to do next.
At profiq, we wanted to understand a bit more about how such agents work and the best way to do it is to build one from scratch. Our goal was to keep the design as simple as possible, but at the same time provide some modularity. The agent framework we developed works with pretty standard components: the user can set the agent’s system message defining its high-level behavior and when deciding what to do next the agent has access to its own history. The agent can also use various tools to make stuff happen. What might be slightly unusual is the inclusion of what we call context message, which gives the agent the current “view of the world”. In our design, we decided to encapsulate tools and context message generation into plugins. Let’s have a look at how they work.
Agents live in an environment
Agents can’t operate in a vacuum. If we want them to be useful, we have to give them information about the outside environment and a set of actuators to modify this environment. We fulfill this requirement with plugins. If the LLM is the agent’s brain, then plugins are its eyes and hands.
Let’s get a bit more technical. How exactly can a plugin help an agent sense and manipulate its environment? We know that its brain, the LLM, communicates mainly through text. So the plugin has to report what it “sees” in the text. In our agent architecture, it does this by providing a context message — a simple string summarizing the environment from the agent’s point of view. An agent can have multiple plugins. You can imagine this as having multiple apps open at the same time. For example, a calendar plugin could show scheduled events to the agent, and a to-do list can show a list of tasks to be done. For a human, we would put these side by side on a screen, for the agent we just concatenate the text representations in a context message.
Other architectures rely on the message and action history to convey information about the current state of the agent’s environment. But this approach can lead to several problems such as using more tokens than needed actually, making the agent confused when the message history contains multiple previous states of the world, or losing information by truncating history to fit into the token limit imposed by GPT. In contrast, the context message never becomes a permanent part of the agent’s message history. It is generated from scratch and appended to the request for each interaction.
What about acting and modifying the environment? Our plugins provide tools for the agent to use as it sees fit. We utilize OpenAI’s function calling feature to make this happen. When working with the chat competition API, you can provide a list of functions GPT can “call”. Each function is defined by a JSON object containing its name, description, and parameters. When GPT decides to “call” a function it returns a special type of response instead of a simple text message. You can detect this function call in your application and react to it for example by calling an actual function you defined in your code. This mechanism allows GPT to interact with external systems.
We implement each tool as a simple method in a class representing a specific plugin. By tagging a method with a @tool decorator, we register it as a tool. As the agent crunches through the task at hand, it can at any point ask all plugins to tell it which tools are available and call some of them when appropriate.
Let’s have a look at a simple plugin for managing a to-do list. We are using a Plugin class from an agent-building framework we are developing at profiq:
class TodoPlugin(Plugin): name: str = "TodoPlugin" todos: list[dict] = [] def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) @property def context_message(self) -> str: ctx = "LIST OF TODOS:\n" for todo in self.todos: ctx += f"[{'COMPLETED' if todo['completed'] else 'TODO'}] {todo['title']}\n" return ctx @tool def add_todo(self, title: str): """ Adds a new item to the todo list. :param str title: The title of the todo item. """ self.todos.append({"title": title, "completed": False}) return f"Added todo: {title}" @tool def mark_completed(self, title: str): """ Marks a todo item as completed. :param str title: The title of the todo item. """ for todo in self.todos: if todo["title"] == title: todo["completed"] = True return f"Marked todo as completed: {title}" return f"Could not find todo: {title}" @tool def remove(self, title: str): """ Removes a todo item from the list. :param str title: The title of the todo item. """ for todo in self.todos: if todo["title"] == title: self.todos.remove(todo) return f"Removed todo: {title}" return f"Could not find todo: {title}"
Quite simple, right? You can notice that there is no need to define the JSON description for the add_todo(), mark_completed(), and remove() tools as required by the OpenAI API. We instead use a bit of introspection magic to automatically generate this description from the docstring. Pretty, cool right?
Creating an agent
We have a limb, but it’s quite useless if not connected to a brain that controls it. Let’s create a simple agent and connect the todo manager plugin to it:
todo_plugin = TodoPlugin() agent = Agent(agent_name="TodoAgent", model="gpt-4-1106-preview") agent.add_plugin(todo_plugin)
We can now ask the agent to perform various tasks:
The example above shows another important feature of our agent architecture: The agent works by performing a sequence of interactions. Each interaction is a two-step process. First, we ask the agent to generate an interaction. This means that the agent sends a request to GPT with message history, the context message, and optionally a user prompt, and asks it to generate a response. The response can be a piece of text or a tool call request. Second, we can commit the interaction, which means appending the response from GPT to the message history, calling a tool if GPT asks us to do so, and appending the tool call response to the message history as well. After making a commit we can start a new interaction.
Why this two-step process? It’s no secret that autonomous agents are not very reliable yet. By separating the generation and commit steps, we can try many different interactions by tweaking the parameters such as the system message, user prompt, the model we are using, and so on. We can also completely overwrite the agent response. This allows us to iterate quickly towards an interaction satisfying our needs. When we are happy with what we see we can perform a commit. We think that such developer supervision can help us develop better agents capable of autonomously performing very complex tasks consisting of large amounts of steps. We could even use the developer feedback to generate a dataset to fine-tune an LLM model to improve its performance when acting as an agent’s brain. We will talk more about this idea in a future article. For now, let’s get back to our to-do agent.
We added a few items to our to-do list. We can check the context message generated by our plugin and then perform a few more interesting interactions:
After listing our to-do items, we were able to mark one of the tasks as completed by invoking the mark_completed() tool. And what’s even more impressive is that we could even ask a general question not corresponding to any tool provided by the todo plugin. Since enough context was provided, GPT had no issue answering.
What’s next?
As we already mentioned, we would like to build on top of what we introduced in this article. We are already working on a UI for making developer supervision as easy as possible or on a mechanism for storing agent configuration and interactions so they can later be used to fine-tune an LLM. You can visit our GitHub repo to check the current status of our work.
We also want to create agents that can help both you and us in day-to-day work. We have many QA experts at profiq and we want to make them even better. So the first practical application of our agent framework will be QA automation. Stay tuned!
And what about you? Are you thinking about autonomous agents or other cool LLM applications? Let’s discuss this in the comments!