Managing an AI developer: Lessons Learned from SMOL AI — Part 1
Last Updated on June 28, 2023 by Editorial Team
Author(s): Meir Kanevskiy
Originally published on Towards AI.
One of the most interesting ramifications of the recent breakthroughs, specifically in large language models, is the potential for building automated agents capable of fulfilling work projects using these models. The theoretical capabilities here are vast. Sometimes, when communicating with a chat-based model like ChatGPT, a simple follow-up prompt such as “improve your answer” or “make your answer more accurate” can significantly enhance the initial query’s response. However, building such automated agents raises an old-new problem: effective management and responsibility. How do you manage such an agent? Managing real human beings is no simple task and has spawned thousands of pages of literature, as well as popular and polished practices and methodologies based on decades of experience. Can these practices be applied to AI-agent-performed development? What factors should be considered? What metrics should be assigned to projects delivered by AI agents? While we cannot fully answer these questions, let’s consider a specific case and see what we can learn from it.
The task
In various situations, there is a need to label different entities, such as configurations, datasets, models, color schemes, or any other meaningful group of similar entities in a project. It would be helpful to quickly assign recognizable names to these entities. Recognizable to a human eye (as not in UUID4) and names in plural (as in more than there are recallable colors in a rainbow). Frequent users of e.g. docker cli or wandb might have already recognized the issue. Running docker containers are automatically labeled with rather funny and due to their absurdity easily discernible names like heuristic_einstein or musing_babbage. If we pause right here, a human being reading this article probably needs to read no further to state, execute and deliver the project we’re about to hand over to our AI agent. Our human perception is an amazingly complex thing, utilizing our whole life experience full with different semantics and abstractions we take for granted. We analyze, assume and extrapolate our observations at once, not even thinking of the grand baggage of experience that stands behind those abilities. However, when approaching a language model, one has to be very judicious and humble with regards to one’s understanding of how a model actually thinks. To take on our experiment, we will use the brilliant and light-speed evolving SMOL-AI’s developer, according to its own readme:
Human-centric & Coherent Whole Program Synthesis aka your own personal junior developer
Initial problem statement
So, let’s say we want to write a reusable program (not just a snippet) that generates a random name following a naming scheme similar to that of docker. According to SMOL Dev’s readme, an initial prompt needs to be provided along with the supported model’s API key (in this case, GPT4). The result is evaluated and repeated if necessary.
So let’s write our initial prompt. Since luckily we don’t have to utilize our own creativity, let’s make the naming scheme a little more nuanced:
Please write a naming scheme generator function
that generates a random name to the likes of running docker containers,
consisting of an adjective and a noun,
where adjective is an emotional description, e.g. dramatic,
and noun is an abstract entity like manifold.
It has to contain up to 7 adjective options for every letter
and up to 7 nouns for every letter.
Initial result
Having installed necessary python requirements, running smol dev on a prompt is as easy as (using only )
export OPENAI_API_KEY=***
python main_no_modal.py ./prompts/prompt.md
Scope
Smol dev’s working cycle took several minutes and produced the following repo:
├── adjectives.py
├── naming_scheme_generator.py
├── nouns.py
└── shared_dependencies.md
The main function was quite straightforward and, arguably, the random letter one-liner is even elegant:
import random
from adjectives import adjectives
from nouns import nouns
def get_random_element(arr):
return random.choice(arr)
def generate_random_name():
random_letter = chr(97 + random.randint(0, 25))
adjective = get_random_element(adjectives[random_letter])
noun = get_random_element(nouns[random_letter])
return f"{adjective}-{noun}"
if __name__ == "__main__":
print(generate_random_name())
2 elements are drawing attention here, as they have not been expxlicitly specified by the prompt:
- This file is cmd executable and has a designated
__main__
branch. - Random element choice ids abstracted into its own function
Reasoning
It’s hard to determine the exact motivation behind introducing these elements into the solution, but if done by a human, the __main__
clause would likely be dictated by the desire to easily turn the standalone function described in the prompt into a ready-to-use CLI tool. This simple improvement would allow the function to be run in the shell and chained with other commands. A suitable term for this could be "MMVP" (Minimal Minimal Value Product), an addon to the prompt resolution that requires minimal additional effort to make it actually usable.
Abstracting the random choice function, on the other hand, is a more obvious addition that eases expected further modifications.
Functionality
Correct integration, including importing the corresponding iterable and randomly indexing it with the (almost as we’ll shortly see) correct length constraints, is not surprising to users familiar with GPT’s brilliance. Now let’s examine the components that provide the actual functionality: adjectives.py
and nouns.py
.
nouns.py
fits our prompt well, with slight deviations from the requested semantic scope requested, which, with all honesty was vague enough to allow that:
nouns = {
'a': ['abyss', 'angel', 'artifact', 'anomaly', 'algorithm', 'atmosphere', 'antenna'],
'b': ['beacon', 'bubble', 'boundary', 'balance', 'butterfly', 'breeze', 'blossom'],
'c': ['cosmos', 'catalyst', 'crystal', 'conundrum', 'cipher', 'cascade', 'crescendo'],
'd': ['dimension', 'dynamo', 'dream', 'duality', 'dawn', 'dusk', 'divergence'],
'e': ['echo', 'eternity', 'enigma', 'essence', 'evolution', 'eclipse', 'equilibrium'],
'f': ['frequency', 'flux', 'fountain', 'fusion', 'fable', 'fantasy', 'form'],
'g': ['galaxy', 'gravity', 'glimmer', 'glow', 'genesis', 'garden', 'groove'],
'h': ['harmony', 'horizon', 'haze', 'hallucination', 'hologram', 'hypnosis', 'haven'],
'i': ['infinity', 'illusion', 'insight', 'imagination', 'intuition', 'inception', 'impulse'],
'j': ['juxtaposition', 'jubilation', 'jigsaw', 'journey', 'jolt', 'junction', 'jazz'],
'k': ['kaleidoscope', 'karma', 'key', 'kinetic', 'kismet', 'kinesis', 'knight'],
'l': ['labyrinth', 'light', 'luminescence', 'levitation', 'landscape', 'legacy', 'loop'],
'm': ['manifold', 'matrix', 'mystery', 'mirage', 'metamorphosis', 'momentum', 'myth'],
'n': ['nebula', 'nexus', 'nocturne', 'novelty', 'nirvana', 'numen', 'nucleus'],
'o': ['oasis', 'odyssey', 'oracle', 'oscillation', 'origin', 'overture', 'oblivion'],
'p': ['paradox', 'portal', 'pulse', 'pattern', 'phantom', 'prophecy', 'paragon'],
'q': ['quasar', 'quantum', 'quintessence', 'quiver', 'quandary', 'quicksilver', 'quest'],
'r': ['resonance', 'reverie', 'rhythm', 'radiance', 'reflection', 'rift', 'revelation'],
's': ['symphony', 'spectrum', 'shadow', 'silhouette', 'solstice', 'serendipity', 'sanctuary'],
't': ['transcendence', 'tesseract', 'twilight', 'tide', 'tempo', 'trance', 'threshold'],
'u': ['universe', 'utopia', 'unity', 'umbra', 'uplift', 'ultra', 'undulation'],
'v': ['vortex', 'vision', 'vibration', 'voyage', 'veil', 'vivid', 'vanguard'],
'w': ['wavelength', 'whisper', 'wonder', 'wisdom', 'web', 'wind', 'wave'],
'x': ['xenon', 'xanadu', 'x-ray', 'xenolith', 'xylem', 'xenogenesis', 'xylophone'],
'y': ['yonder', 'yearning', 'yugen', 'yin', 'yang', 'yawn', 'yesteryear'],
'z': ['zenith', 'zephyr', 'zone', 'zodiac', 'zigzag', 'zombie', 'zeppelin']
}
adjectives.py
, however, reveals an interesting surprise:
adjectives = {
'a': ['angry', 'anxious', 'amused', 'adorable', 'amazing', 'ambitious', 'artistic'],
'b': ['brave', 'bored', 'bitter', 'bold', 'bashful', 'bewildered', 'benevolent'],
# ...
'z': ['zealous', 'zany', 'zestful', 'zen', 'zippy', 'zombie-like', 'zigzag']
}
This dictionary would obviously be unusable in our case and would cause an unhandled exception if a letter other than ‘a’, ‘b’, or ‘z’ is chosen.
Reasoning
When done by a human, such a mistake would be a classic evidence of lack of testing:
- A functional block was implemented with an incomplete placeholder.
- Sufficient testing was not performed.
- The placeholder was left in the final solution.
The interesting deviation in our case is the involvement of two actors: the SMOL AI agent and the actual content supplier in the form of a language model. As is evident from the length of the supplied lists, this dictionary was intended to be a finalized answer to the query for supplying the adjectives component. However, the language model sometimes omits repetitive code and provides an initial example without completing it. In a chat use-case, this often takes the form of an answer containing general directions for performing a task instead of actual code. This occurs when the prompt does not emphasize providing code specifically and is more of a general question on a matter.
Even less “human” in this case is the mismatch between the attention to detail in the main file we described above and this incompleteness being overlook.
Improvement
Requirements
Taking above considerations into account, let’s engineer a more advanced prompt. This time, we’ll describe our needs in a formal manner, more suitable to how an actual project might be described, while adding some useful degrees of freedom:
This project is a naming scheme generator.
It has to be subject to the following tech stack:
1. Programming language: wherever possible, but not limited to python
It has to be subject to the following specification:
1. Having an importable and callable method in python that returns a random name
2. Said name has to be a pair of an adjective and a noun starting with the same letter to the likes of docker container naming convention
3. Adjective and noun have to be of specific theme. Default themes should emotional description, e.g. dramatic for adjective and an abstract entity like manifold for noun.
4. Apart from the default themes, theme pair should be customizable by providing a json with list of available options per each letter of the English alphabet. Providing the json can be constrained to a certain folder inside the project structure.
5. it has to have tests, evaluating extensiveness of fulfilling the above specifications
- Stating the tech stack: It is by chance that the provided solution was in Python. We never explicitly asked for it, and the referenced Docker naming scheme is, expectedly, written in Go.
- Specifying the use-case: We wouldn’t want to leave the usability of our solution to the model’s consideration alone. As we’ve seen, it can be reduced to an MMVP, as described above.
- Static data parametrization: Although the referenced Docker naming scheme is hardcoded, generally one might want to avoid such an approach, especially for the central functionality of the project.
- Testing: as we’ve seen, as opposed to adding cli usability of its own accord, the model did not provide any unit tests for the solution. This had serious ramifications, resulting in a completely broken key component (the adjectives list).
Improved result
This time the run took 11 minutes and produced a much more elaborate repo structure:
├── naming_scheme_generator
│ ├── generator.py
│ ├── __init__.py
│ └── themes
│ ├── custom_theme_example.json
│ ├── default_theme.json
│ └── __init__.py
├── shared_dependencies.md
└── tests
├── __init__.py
├── test_generator.py
└── test_themes.py
This structure is sound and well-aligned with our above requirements. The word registry has been moved to a single JSON file, correctly listing all the letters combined in the following format. You can notice that the model returned a minimal plural number of 2, as we did not provide a prerequisite for the number of words per letter:
"A": {
"adjectives": ["anxious", "amazing"],
"nouns": ["artifact", "ambience"]
}
generator.py
showed another interesting change. Note that the improvised "MMVP" in the form of a __main__
branch for CLI usage is now absent, as we explicitly stated that the usage should be an "importable and callable method in Python":
import random
from typing import Dict, List
from .themes import default_theme
def generate_name(theme_adjective: Dict[str, List[str]], theme_noun: Dict[str, List[str]]) -> str:
letter = random.choice(list(theme_adjective.keys()))
adjective = random.choice(theme_adjective[letter])
noun = random.choice(theme_noun[letter])
return f"{adjective}_{noun}"
def load_custom_theme(file_path: str) -> Dict[str, List[str]]:
with open(file_path, "r") as file:
custom_theme = json.load(file)
return custom_theme
def generate_name_with_theme(theme: str = "default") -> str:
if theme == "default":
theme_adjective = default_theme["adjectives"]
theme_noun = default_theme["nouns"]
else:
custom_theme = load_custom_theme(theme)
theme_adjective = custom_theme["adjectives"]
theme_noun = custom_theme["nouns"]
return generate_name(theme_adjective, theme_noun)
Multiple problems become apparent here:
- Import errors: The
json
library import is missing. - The
.themes
folder (leaving aside the taste question of using.
imports) does not have any Python-importabledefault_theme
. Instead, it's a JSON file meant to be read from Python code. The same error is present in the__init__.py
file of thethemes
folder:
from .default_theme import default_theme
from .custom_theme_example import custom_theme_example
- The theme dictionary, if read as-is from the JSON file, has a structure opposite to that implied by the code. The theme itself doesn’t have “adjective” and “noun” keys. Its keys are letters of the alphabet, each nested with adjectives and nouns.
Due to the above reasons, the written tests are failing on the imports stage and cannot be executed. However, their usage of imported functions is correct, and their extensiveness is genuinely impressive:
import unittest
from naming_scheme_generator.generator import generate_name, load_custom_theme
from naming_scheme_generator.themes import default_theme
class TestGenerator(unittest.TestCase):
def test_generate_name_default_theme(self):
name = generate_name(default_theme.adjectives, default_theme.nouns)
self.assertIsNotNone(name)
self.assertTrue(isinstance(name, str))
self.assertEqual(len(name.split(" ")), 2)
self.assertEqual(name.split(" ")[0][0], name.split(" ")[1][0])
def test_generate_name_custom_theme(self):
custom_theme = load_custom_theme("naming_scheme_generator/themes/custom_theme_example.json")
name = generate_name(custom_theme.adjectives, custom_theme.nouns)
self.assertIsNotNone(name)
self.assertTrue(isinstance(name, str))
self.assertEqual(len(name.split(" ")), 2)
self.assertEqual(name.split(" ")[0][0], name.split(" ")[1][0])
def test_load_custom_theme(self):
custom_theme = load_custom_theme("naming_scheme_generator/themes/custom_theme_example.json")
self.assertIsNotNone(custom_theme)
self.assertTrue(hasattr(custom_theme, "adjectives"))
self.assertTrue(hasattr(custom_theme, "nouns"))
if __name__ == '__main__':
unittest.main()
Intermediate conclusions
With the above said and considering the very limited scope of our “test project,” which hardly allows for a component breakdown, one could argue that:
- In the initial case, the “core functionality” of the solution was faulty. Even though the utility code was correct, the human supervisor would have had to come up with the missing words, which were plenty. Additional effort in this case would be of a creative nature.
- In the latter case, the necessary improvements are technical. Missing imports need to be corrected, the default theme workflow needs to be changed from a Pythonic import to a JSON read (similar to the already correct custom theme), and the nesting levels in the word dictionary need to be swapped.
- In the latter case, the project infrastructure is correct. Theme customization is separated from the main generator code, and extensive tests are correctly organized and separated into a distinct folder.
The fallbacks of the second solution are much easier to correct and can be delegated to a junior developer role, while the fallbacks of the first case would correspond to a major core algorithm issue in a more complex project, requiring the involvement of a more senior and qualified human supervisor.
Of course, it’s also worth mentioning that with all of the above, the achievements that SMOL AI contributors have been able to attain in such a short time are fascinating. The magnificence of the latest developments in large language models makes it easy to theorize about their automation. However, bringing a solution like this to an actual usable implementation is a different class of achievement.
In Part 2 we’ll take a look at further iterations and see if SMOL Dev can actually become that junior developer and improve its own results.
Thank you for reading!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI