Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Managing an AI developer: Lessons Learned from SMOL AI — Part 2
Latest   Machine Learning

Managing an AI developer: Lessons Learned from SMOL AI — Part 2

Last Updated on June 28, 2023 by Editorial Team

Author(s): Meir Kanevskiy

Originally published on Towards AI.

Source: Image by DALL-E

In Part 1 of our exploration into managing an AI agent developer, we embarked on an experiment to tackle the challenges that arise in overseeing an automated workforce. We entrusted the role of the “junior developer” to SMOL AI’s rapidly evolving developer, as detailed in its readme:

a prototype of a “junior developer” agent (aka smol dev).

The focal point of our experiment was a specific problem statement: the need to generate a naming scheme that follows the conventions of running Docker containers. We provided two prompts to the AI agent: the initial prompt outlined the requirement for a function that generates random names consisting of an adjective and a noun,

Please write a naming scheme generator function that generates a random name
to the likes of running docker containers, consisting of an adjective and
a noun, where adjective is an emotional description, e.g. dramatic,
and noun is an abstract entity like manifold.
I thas to contain up to 7 adjective options for every letter and
up to 7 nouns for every letter.

While the second prompt introduced the customization of theme pairs by allowing users to provide their own JSON file with options for each letter of the alphabet and featured much more elaborate and formalized project specifications:

This project is a naming scheme generator.
It has to be subject to the following tech stack:
1. Programming language: wherever possible, but not limited to python
It has to be subject to the following specification:
1. Having an importable and callable method in python that returns a random name
2. Said name has to be a pair of an adjective and a noun starting with the same letter to the likes of docker container naming convention
3. Adjective and noun have to be of specific theme. Default themes should emotional description, e.g. dramatic for adjective and an abstract entity like manifold for noun.
4. Apart from the default themes, theme pair should be customizable by providing a json with list of available options per each letter of the English alphabet. Providing the json can be constrained to a certain folder inside the project structure. Said json should be explicitly read by python's json library to a dictionary and then used throughout other modules of the project
5. it has to have tests in pytest, evaluating extensiveness of fulfilling the above specifications

Through this experiment, we delved into the complexities of managing an AI agent, highlighting the significance of comprehending the agent’s thinking process and providing appropriate guidance. The second prompt let us achieve significant improvement in core functionality and also added testing omitted by default.

This point in our development experiment would be a fitting one to explore the question of optimal stopping. An analogy from a different AI application would be what is known in autonomous driving as a takeover or intervention event, where the human driver takes over the steering wheel from the autopilot. However, unlike in autonomous driving, where intervention is considered a catastrophic failure due to the clear safety implications, such interventions might be commonplace in automated development for years to come, if not indefinitely. While autonomous driving agent has a closed-form purpose of getting passengers from A to B, maximizing safety and speed, in all types of creative labor, there might be a threshold of idea origination unachievable by an automated agent. Even if this is not the case — supervision and responsibility over the final delivery would likely be on the human supervisor, if not in the role of a more senior developer, then as an equivalent of the current project manager/team lead.

With this in mind, let’s explore 2 different paths from where we have stopped in Part 1:

  1. We will try to finalize the project ourselves
  2. We will ask smol dev to iterate on the result, slightly modifying the prompt

Takeover

Functional code

A first glance at the test runs leaves much to be desired. As expected, due to import issues, none of the tests ran. Firstly, we will correct the obvious linting mistakes and import mistakes. JSON import is omitted, and necessary JSON reads are misplaced as Python imports:

--- a/smol_ai_developer/generated_namer2/naming_scheme_generator/generator.py
+++ b/smol_ai_developer/generated_namer2/naming_scheme_generator/generator.py
@@ -1,25 +1,23 @@
+import json

--- a/smol_ai_developer/generated_namer2/naming_scheme_generator/themes/__init__.py
+++ b/smol_ai_developer/generated_namer2/naming_scheme_generator/themes/__init__.py
@@ -1,2 +1,7 @@
-from .default_theme import default_theme
-from .custom_theme_example import custom_theme_example
+import json
+
+with open("./naming_scheme_generator/themes/default_theme.json") as f:
+ default_theme = json.load(f)
+
+with open("./naming_scheme_generator/themes/custom_theme_example.json") as f:
+ custom_theme_example = json.load(f)

Upon a second glance, we can see how deeply the misplaced dictionary levels go. We need to unify them to fit the ‘letter > nouns/adjectives’ structure, not the other way around:

--- a/smol_ai_developer/generated_namer2/naming_scheme_generator/generator.py
+++ b/smol_ai_developer/generated_namer2/naming_scheme_generator/generator.py
@@ -1,25 +1,23 @@
+import json
import random
from typing import Dict, List
from .themes import default_theme

-def generate_name(theme_adjective: Dict[str, List[str]], theme_noun: Dict[str, List[str]]) -> str:
- letter = random.choice(list(theme_adjective.keys()))
- adjective = random.choice(theme_adjective[letter])
- noun = random.choice(theme_noun[letter])
+def generate_name(theme) -> str:
+ letter = random.choice(list(theme.keys()))
+ adjective = random.choice(theme[letter]['adjectives'])
+ noun = random.choice(theme[letter]['nouns'])
return f"{adjective}_{noun}"

-def load_custom_theme(file_path: str) -> Dict[str, List[str]]:
+def load_custom_theme(file_path: str):
with open(file_path, "r") as file:
custom_theme = json.load(file)
return custom_theme

def generate_name_with_theme(theme: str = "default") -> str:
if theme == "default":
- theme_adjective = default_theme["adjectives"]
- theme_noun = default_theme["nouns"]
+ theme = default_theme
else:
custom_theme = load_custom_theme(theme)
- theme_adjective = custom_theme["adjectives"]
- theme_noun = custom_theme["nouns"]

- return generate_name(theme_adjective, theme_noun)
+ return generate_name(theme)

In smol dev’s defense we must say that control flow is arguably correct and, strictly speaking, optional, function typings are (almost) as well.

At this point, our module is actually working:

In [1]: from naming_scheme_generator.generator import generate_name_with_theme

In [2]: large_name_list = [generate_name_with_theme() for _ in range(100)]

In [3]: len(set(large_name_list))
Out[3]: 62

In [4]: large_name_list[42]
Out[4]: 'zen_zephyr'

This, actually, is substantial, as we corrected two glaring mistakes we have identified immediately at the initial review of the results in Part 1. As we’ll shortly see, there was room for much more subtle mistakes.

As it was included in the specification, let’s take a look at the tests.

Testing

We have created two sets of tests, effectively distinguishing between main generation functionality and theme customization. Both files have several issues, which can largely be divided into two groups:

  1. Technical issues, such as using generated_name.split(" "), while the generator file explicitly uses a hard-coded "_" between an adjective and a noun. All of these cases can be resolved with a simple replace function. More nuanced examples would include contradicting function signatures from the main module, which in Python are optional (in fact, in some instances, signatures are incorrect, while their output usage in tests is accurate).
  2. Design issues, like referencing default_theme.adjectives or self.assertIsInstance(custom_theme, themes.Theme).

The second issue is arguably a symptom of a more significant problem we’ve noted from the beginning: not only can the dev agent fail to maintain consistent indexing in the nested dictionary of our theme structure, but sometimes it can incorrectly assume a different structure altogether. While writing the tests, the agent made an erroneous assumption that a theme is not a dictionary at all, but a custom object with attributed nouns and adjectives.

In other words, what our agent likely lacks is an explicit design specification.

These issues necessitate substantial, but ultimately trivial, corrections in the testing code, as the tested functions’ signatures do not require any specifications of the theme object’s interface at all:

@@ -5,19 +5,19 @@ from naming_scheme_generator.themes import default_theme
class TestGenerator(unittest.TestCase):

def test_generate_name_default_theme(self):
- name = generate_name(default_theme.adjectives, default_theme.nouns)
+ name = generate_name(default_theme)
self.assertIsNotNone(name)
self.assertTrue(isinstance(name, str))
- self.assertEqual(len(name.split(" ")), 2)
- self.assertEqual(name.split(" ")[0][0], name.split(" ")[1][0])
+ self.assertEqual(len(name.split("_")), 2)
+ self.assertEqual(name.split("_")[0][0], name.split("_")[1][0])

def test_generate_name_custom_theme(self):
custom_theme = load_custom_theme("naming_scheme_generator/themes/custom_theme_example.json")
- name = generate_name(custom_theme.adjectives, custom_theme.nouns)
+ name = generate_name(custom_theme)
self.assertIsNotNone(name)
self.assertTrue(isinstance(name, str))
- self.assertEqual(len(name.split(" ")), 2)
- self.assertEqual(name.split(" ")[0][0], name.split(" ")[1][0])
+ self.assertEqual(len(name.split("_")), 2)
+ self.assertEqual(name.split("_")[0][0], name.split("_")[1][0])
@@ -7,33 +7,34 @@ class TestThemes(unittest.TestCase):

def test_load_custom_theme(self):
custom_theme_path = "naming_scheme_generator/themes/custom_theme_example.json"
- custom_theme = themes.load_custom_theme(custom_theme_path)
- self.assertIsInstance(custom_theme, themes.Theme)
+ custom_theme = generator.load_custom_theme(custom_theme_path)
+ self.assertIsInstance(custom_theme, dict)

with open(custom_theme_path, "r") as f:
custom_theme_data = json.load(f)

for letter, theme_data in custom_theme_data.items():
- self.assertEqual(custom_theme.adjectives[letter], theme_data["adjectives"])
- self.assertEqual(custom_theme.nouns[letter], theme_data["nouns"])
+ self.assertEqual(custom_theme[letter]['adjectives'], theme_data["adjectives"])
+ self.assertEqual(custom_theme[letter]['nouns'], theme_data["nouns"])

def test_generate_name_default_theme(self):
default_theme = themes.default_theme
- generated_name = generator.generate_name(default_theme.adjectives, default_theme.nouns)
+ generated_name = generator.generate_name(default_theme)
self.assertIsInstance(generated_name, str)
- self.assertEqual(len(generated_name.split(" ")), 2)
+ self.assertEqual(len(generated_name.split("_")), 2)

- adjective, noun = generated_name.split(" ")
+ adjective, noun = generated_name.split("_")
self.assertEqual(adjective[0], noun[0])

def test_generate_name_custom_theme(self):
custom_theme_path = "naming_scheme_generator/themes/custom_theme_example.json"
- custom_theme = themes.load_custom_theme(custom_theme_path)
- generated_name = generator.generate_name(custom_theme.adjectives, custom_theme.nouns)
+ custom_theme = generator.load_custom_theme(custom_theme_path)
+ generated_name = generator.generate_name(custom_theme)
+ print(generated_name)
self.assertIsInstance(generated_name, str)
- self.assertEqual(len(generated_name.split(" ")), 2)
+ self.assertEqual(len(generated_name.split("_")), 2)

- adjective, noun = generated_name.split(" ")
+ adjective, noun = generated_name.split("_")

Agent’s self-correction

This time, we will not amend the prompt. While we want to allow the dev agent to improve on its previous result, we deliberately avoid expanding on specific design details mentioned above, as doing so would shift our role from general product ownership to a more involved development role. This iteration, however, was a surprising failure — the overridden repo now looks like this:

├── naming_scheme_generator
│ └── themes
├── project
│ ├── default_themes.json
│ └── naming_scheme_generator.py
├── shared_dependencies.md
└── tests
└── test_naming_scheme_generator.py

with themes folder likely intended for theme customization being empty and naming_scheme_generator.py containing unparsed GPT4 dialogue parts and the code inside referencing nonexistent theme_loader and name_generator:

Code for project/naming_scheme_generator.py:

```python
import random
import re
from theme_loader import load_theme_options
from name_generator import generate_name

def generate_naming_scheme():
theme_options = load_theme_options()
name = generate_name(theme_options)
while not re.match(r'^[a-z]_[a-z]+$', name):
name = generate_name(theme_options)
return name.replace('_', '-')

if __name__ == '__main__':
print(generate_naming_scheme())
```

This file imports the `load_theme_options` function from `theme_loader.py` and the `generate_name` function from `name_generator.py`. It defines a `generate_naming_scheme` function that generates a random name based on the loaded theme options and validates that the name is in the format of an adjective and a noun starting with the same letter separated by an underscore. Finally, it replaces the underscore with a hyphen and returns the generated name.

The `if __name__ == '__main__'` block is used for testing purposes and prints a generated name when the file is run as a script.

The test file also bears the same problem, likely an uncovered edge case in parsing the model output:

```
import pytest
from project.naming_scheme_generator import generate_name

def test_generate_name():
name = generate_name()
assert isinstance(name, str)
assert len(name.split()) == 2
assert name[0].lower() == name.split()[1][0].lower()
assert name.split()[0] in ['dramatic', 'melancholic', 'romantic', 'whimsical', 'nostalgic', 'mysterious', 'hopeful', 'futuristic', 'fantastical', 'enigmatic']
assert name.split()[1] in ['manifold', 'matrix', 'mystery', 'mirage', 'muse', 'machine', 'momentum', 'myth', 'maze', 'magnet']
```

Taking over at this point would be considerably more challenging than even starting from the first prompt we began with. This is something one has to be prepared for when managing automated AI development. Consistency of improvements between iterations is much more challenging to achieve than consistency within a single repository. At this point, the development route has hit a dead end — it would be unwise to continue from here. A possible course of action would be to revert to the previous step and either take over, as we’ve done above, or retry the self-improvement iteration.

Conclusions

Overall, the current progress in automated agent development can be seen as lightning-fast. Developing such a system with even minimal consistency in its behavior is no small feat.

The example we’ve shown might suggest a certain pattern in designing requirements for such an engine, which somewhat resembles the once widely used waterfall approach. This approach requires a detailed requirements and specifications document at the project kickoff, containing as many minute implementation details as possible.

This approach is going to be continually reinforced by growing context window of large language models.

This approach will likely be continually reinforced by the growing context window of large language models. On the other hand, transitioning to an iterative approach like classic agile or its variations may not be a trivial choice. Such an approach shifts significant weight from the initial design specifications to continuous interactions between different project members, product owners, and the end client. This falls within the scope of the automating agent, not the model itself, as is the case with increasing context windows.

While our coverage is limited in terms of functionality and product scope, the technical stack (being a dynamically typed language) and the task itself (requiring a combination of different types of content like code and static data) hopefully allow us to uncover potential pitfalls and gain useful insights.

Thank you for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓