How Far Should You Go to Perfect Your AI System?
Last Updated on September 18, 2024 by Editorial Team
Author(s): Konstantin Babenko
Originally published on Towards AI.
How Far Should You Go to Perfect Your AI System?
The process of validating a conversational AI system is as complex as an onion: When you remove a layer, there is always another layer to uncover. It is crucial to test each phase to confirm how the AI system integrates and operates as an entity.
In the case of conversational AI systems, we have to go beyond traditional software testing paradigms and techniques. Despite the importance of unit testing, integration testing, and system testing, we also need to pay attention to features specific to language models. This article examines key testing phases of conversational AI systems that are critical to its success along with the real-life examples of testing AI applications in each phase.
Unit Testing
Fundamentally, unit testing acts as the base level of testing, which proves the effectiveness of individual units. Here, developers often find themselves dealing with many unexpected behaviors of language models. An answer that may appear superb when delivered in isolation may sound unconvincing when delivered amidst a slightly different setting.
Think about our research case of the chatbot development for an airline. In the unit testing phase, the team realized that although the model responded very well to questions about flight schedules, it underperformed when dealing with dates. When using the phrase βnext Fridayβ, the program provided ambiguous answers depending on the current date, which underlined the necessity of using deeper temporal reasoning. This led to the situation where the developers had to tune the recognition of the date expression and incorporate a context-sensitive temporal model.
In the use cases where the developed AI models are applied to customer interactions, unit testing also involves testing responses to usual customer queries, handling of exceptions, and backup plans. Ensuring that the AI can answer multiple questions with high levels of accuracy and handle unknown questions politely is very important.
Integration Testing
Moving to integration testing, which is the next phase from the unit testing, things get more complicated. Subsystems that perform optimally when tested alone can turn out to be incompatible when integrated. Often you can find a well-designed language model that performs poorly when combined with some simple knowledge base or, conversely, a crude UI.
For example, in one of our projects, we discovered that when the AI assistant was integrated with the customer database, it slowed down noticeably. In light of this integration issue, the data retrieval system required a complete overhaul. The lag was discovered during load testing, which entailed running high usage scenarios to determine how the system performs under heavy load.
Another example relates to an e-commerce company that wanted to integrate their AI-powered recommendation system with a large product database and a dynamically updated user profile system. While individual tests established that all these components functioned properly, integration tests exposed efficiency issues with the recommendation generation process for handling live user data input, requiring further enhancements in the database queries and caching.
System Testing
System testing has its own challenges in store. And it is here where the true character of your AI system is revealed to the world β the good, the bad, and the ugly. An AI designed for holding casual conversations could easily score 10/10 on all previous tests and still have difficulty focusing on the topic of conversation or understanding user intent in longer and more complex exchanges.
Certain AI assistants for specific applications, such as medical chatbots, may be inclined to use complicated terminologies that confuse customers. This issue could have been identified during the role-play of extended customer interactions, and fine-tuning the language model is required to provide simpler answers. In such a way, the system testing makes sure that similar situations wonβt occur in a production setting.
Letβs have a look at each system testing phase separately.
- Comprehensive Assessments
The Test Engineer executes functional tests to verify all the significant functionalities of the AI system. For instance, when testing the AI financial assistant, evaluations guarantee that financial analyses and recommendations align with sound financial frameworks and laws.
2. Usage Scenarios
Practical cases are tested in order to understand how the AI system will act in real life. This emulates real-world user behavior to ensure stability in live scenarios. For example, an e-commerce AI has to answer a broad range of questions about products, returns, and shipping and must be able to do so when interacting with real customers.
3. Extended Conversations
Long interaction tests are carried out in order to check whether the context and coherency of the AI are intact even after a long engagement. For instance, in a customer service setting, the AI chatbot should be able to handle and address issues appropriately while maintaining the context of the conversation regardless of how long it may last.
4. Stress Testing
Stress tests are conducted to determine how robust the AI system is during periods of heavy demand and usage. As an illustration, a retail AI system can be tested under high traffic loads to check its ability to respond to a sudden influx of requests due to a flash sale.
Once these exhaustive tests are performed, the test engineer gathers all the test data from the Test Data Repository (TDR). The data is then compiled and used to prepare the system testing report, which outlines the systemβs strengths, weaknesses, and the areas that need to be strengthened.
This comprehensive report assists the stakeholders in making a well-informed decision on further improvement and enhancement of the AI system to meet the expectations of the users in practical scenarios.
User Acceptance Testing
The last stage of quality assurance, the user acceptance testing (UAT), is where the theory comes into contact with reality. Still, actual users sometimes behave in more creative ways than the controlled testing could predict. A gaming company experienced this firsthand when their AI game master, meant to provide an engaging narrative for players, was consistently tricked by gamers into disclosing plot endings and game directions. Such interaction was not foreseen during the initial test phases, which also underlines the uncertain nature of usage patterns.
Letβs explore the UAT process in detail. During this testing phase, real users engage with the AI Assistant. These users are sometimes a limited number of people who comprise the market for the finished product. Such interactions are intended for as close to real-life usage scenarios as possible to make the AI system behave correctly under real conditions.
For instance, the gaming company may introduce its AI game master in a beta testing mode to a limited number of gamers. These users would attempt to interact with the AI in all the ways possible, always trying to put the system to the test and discover its flaws. After real users have interacted with the system, their feedback is captured and taken through the Feedback System. This feedback loop is important as it offers vital information on the end-users perception of the AI system. The Feedback System gathers the observations, synthesizes them and identifies areas that need improvement. This processed feedback is then forwarded back to the engineering team for further iteration and processing. Refining the AI system through an iterative process based on user feedback means that the end-user expectations are met and the overall effectiveness of the user experience is up to standard.
In our example, the development team, based on the received feedback, might choose to increase the coherence of the storyline and/or implement measures that allow for the detection of the attempts of individuals trying to manipulate the AI.
UAT is the last and perhaps one of the most important stages in the testing process of conversational AI systems. It makes the transition from a laboratory to an operational setting where it has to work and provide value to the users in sometimes unpredictable real-world settings.
Special Considerations for Testing LLMs
In the case of testing LLMs in particular, the difficulties increase exponentially since LLMs are complex and language is a subtle phenomenon. These models require precise testing areas to be considered dependable and efficient.
One of the significant issues with LLMs is the ability to identify bias. Bias in AI can be attributed to the training data, which may not capture all the social and cultural aspects as they are. This can result in the AI providing recommendations or answers that are stereotyped or prejudiced in nature.
For instance, when testing the system, it is crucial to consider how the AI behaves when encountering different dialects and sociolects. The AI may focus on particular vernaculars or ethnic dialects more than others, pointing to a bias that requires correction. To address such biases, diverse training data must be integrated into the AI training data set, and specific unit tests can be conducted to ensure that the AI offers fair and inclusive responses.
The other difficulty that can be encountered is related to the language specifics across different languages or dialects. An AI model might work perfectly well with standard or textbook languages, but it will not necessarily understand colloquialisms or regional idioms and, therefore, fail to communicate effectively. The solution here can be returning models on corpora of the target region. Letβs consider an international e-commerce platform; in this case, the AI customer service agent must understand the formal and informal vocabulary. During the system testing, many multilingual test cases would be conducted to assess the flexibility of the AI and its performance within different linguistic contexts.
Similarly, context management, the long-sought objective of conversational AI, often turns out to be the weak link after reaching the later stages of prototyping. For instance, in a pilot test with an AI designed for a financial advisory firm, the system performed well in simple interactions but less well when the conversation was prolonged, occasionally giving advice based on information provided during the early part of the conversation but not updated as the discussion progressed. This involves sophisticated context handling and sustained learning from the flows of dialogues to ensure precision and timeliness of the inferences.
In knowledge-intensive financial advisory applications, AI should be able to preserve context over long-term interactions. When conducting unit and system testing, particular attention should be paid to how the AI behaves in the course of an extended dialogue and whether it re-evaluates its advice based on information that was provided earlier.
As for the performance evaluation of LLMs, measures such as response efficacy, time, and user satisfaction can be used, these metrics will help to monitor the smooth and proper functioning of the AI system.
Real-time monitoring dashboards, stress testing and performance benchmarking can be employed to determine how well the LLMs perform on queries under different circumstances. This monitoring ensures that the system not only does well when it is under normal loads but also when it is under pressure.
There is always feedback and iterations on the lifecycle of the LLM-based systems. This approach makes it easier to detect new problems whenever users provide their feedback during the beta testing phase and enables constant improvements resulting from iteration processes.
For example, if users suggest that the AIβs responses are sometimes unclear, successive cycles of improvement would involve refining the generation of responses to reduce verbosity while increasing clarity. This continuous feedback loop helps the AI get closer and closer to the expected levels set by the user.
Final Thoughts
To summarize, testing and deployment of conversational AI systems is a complex process that has to be approached from several main angles that do not belong to the standard software testing paradigm. Starting from unit tests that target specific parts of the code and up to system tests that check the whole system, each of the phases reveals specific issues, especially when working with LLMs.
Testing is not only about achieving strictly defined quantitative criteria β it is about developing an AI that has a conversation with a human being with fairness, transparency, and, of course, accuracy. This is the way towards achieving the goal, which is filled with numerous trials, iterations, and obstacles, yet results in the development of robust, moral, and efficient AI systems that contribute tangible value.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI