Building Trustworthy AI: Interpretability in Vision and Linguistic Models
Last Updated on October 31, 2024 by Editorial Team
Author(s): Rohan Vij
Originally published on Towards AI.
Building Trustworthy AI: Interpretability in Vision and Linguistic Models
The rise of large artificial intelligence (AI) models trained using self-supervised deep learning methods presents a dangerous situation known as the AI βblack boxβ problem, wherein it is impossible to understand what methods and how a neural network learns what it does. The exponential growth of computational power, availability of massive datasets, and advancements in deep learning algorithms have enabled the development of AI models with extremely large scale and capabilities. This problem is not new in the field of cognitive sciences β the human brain is also considered a black box, as it is impossible to understand how the human brain learns on a fundamental level. Having ununderstandable models performing crucial tasks in business or other high-impact applications is potentially dangerous. It is impossible to determine whether the model has jeopardized decision-making or content-generating capabilities until it does eventually generate false content or make a bad decision. Model users should be able to understand how their data is being used to produce a result. This paper will explore the efficacy of solutions to this problem that attempt to create βinterpretable machine learningβ in the fields of computer vision and large language models. It will assess the effectiveness of these interpretable machine learning approaches in improving the transparency and accountability of AI systems in real-world applications.
Interpretability in Computer Vision (CV) Models
βIf AI enables computers to think, computer vision enables them to see, observe and understandβ (IBM, n.d.). Computer vision uses deep learning to look at image data, find patterns, and then identify one image from another. Computer vision models are based on convolutional neural networks (CNNs), which consist of layers used to detect various features of an input image. CNNs have sliding matrix windows that slide along the pixels of an image to capture spatial information β this is known as a convolutional operation. Each layer in a CNN is intended to detect a certain feature of the input image. As each successive layer receives information from the previous layer, the model is able to build a feature map that combines all of the important features of the image. Layers at earlier stages of the CNN might be responsible for identifying higher-level features such as edges or colors, while deeper layers use the result of the prior ones to detect more complex patterns (Craig, 2024). The increased complexity and application of CNNs raise concerns about how interpretable they are. As more and more layers are added to CNNs, the ability to understand what patterns the CNN is actually identifying that lead it to make a decision is lost. Kevin Armstrong (2011), columnist at βNot a Tesla App,β noted that Teslaβs Full Self-Driving v12:
is eliminating over 300,000 lines of code previously governing FSD functions that controlled the vehicle, replaced by further reliance on neural networks. This transition means the system reduces its dependency on hard-coded programming. Instead, FSD v12 is using neural networks to control steering, acceleration, and braking for the first time. Up until now, neural networks have been limited to detecting objects and determining their attributes, but v12 will be the first time Tesla starts using neural networks for vehicle control.
Teslaβs dramatic shift away from hard-coded rules to having their algorithm for self driving reliant almost entirely on neural networks is concerning in regards to the interpretability and accountability of the self-driving system. If an accident were to occur with FSD v12, it would be harder for Tesla to determine what part of the system was responsible for making the erroneous decision. Without being able to understand how these models reason to arrive at their final decision, they are harder to trust β especially in high-stakes environments such as driving a heavy electric vehicle.
LIME
LIME, short for Local Interpretable Model-agnostic Explanations, is a generalized technique that can be used to understand the reasoning behind any classifier model. LIME is best described as a probe for any model β it creates slight variance in the original data to understand the relationship between those changes and the modelβs final output. LIME allows its users to change specific features of the input to the model, so humans can decide what features are the most important or most likely to overfit and test those to see their impact in the model. LIME outputs a list of explanations that represent each input featuresβ contribution to the final output of the classifier output (Ribero et al., 2016).
A good example of using LIME in CV is to understand the reasoning behind a modelβs prediction:
The creators of LIME ran an experiment with 27 graduate students who had taken an ML course at some point in their academic careers. In the first trial, they provided each of the 27 students with 10 images of a wolf classification model. 8 of the images were classified correctly as wolves, where the other two were misclassified: one was classified as a wolf even though it was a dog with snow in the background, and one was classified as a wolf even though it was a wolf with no snow in the background. 10 out of 27 of the students trusted the model, with 12 out of 27 stating that the presence of snow is a potential feature taken into account by the model. In a second trial with the same 27 participants, an explanation (as in Figure 2) was provided for each modelβs prediction. After the second trial, only 3 students trusted the model, with 25 citing the presence of snow as a potential feature (Ribero et al., 2016).
Grad-CAM
Grad-CAM, or Gradient Weighted Class Activation map, analyzes the last convolutional layer of a CNN to determine what pixels provided the most weightage to the modelβs final result. This is done through a 5-step process (Ahmed, 2022):
- The model is traditionally trained on a set of images to get its predictions and the corresponding weights of the last convolutional layer.
- With the modelβs best classification guess (like βdog,β βcat,β etc. β whatever classification has the highest probability assigned by the network), Grad-CAM computes the gradient of the result compared to the weights/activations of the last convolutional layer. For instance, if the model predicts that the image contains a dog, Grad-CAM computes how minute changes in the modelβs activations (features ranging from as simple to edges/textures to patterns that make up a dogβs nose) would affect that classification. Like LIME, this allows Grad-CAM to identify which features in the image were the most important in leading the model to predict its classification. Unlike LIME, however, Grad-CAM probes the model by looking at the last convolutional layer and understanding how changes there affect the final result, while LIME changes the input image to the model to understand how macro changes affect the final result.
- By looking at the calculations from the last step, Grad-CAM identifies what parts of the last image convolutional layer were important to deciding the modelβs classification.
- Each neuron in the final convolutional layerβs gradient (i.e what was calculated in step 2 β when we increase this activation by n number, how much does the classification change? The higher the change, the higher the gradient, and the more important that neuron is to the final classification) is multiplied to every pixel involved with that neuron channel. As a result, pixels that contribute to the final classification the most are highlighted the most, whereas pixels that negatively contribute to the final classification are not taken into account and are highlighted accordingly. This creates a heatmap, allowing human users to see what parts of the image were the most critical to the modelβs classification decision.
- This βimportance valueβ of the pixels is normalized to be between 0 and 1, allowing for better visualization when the heatmap is overlaid on top of the final image.
Using the following image:
The ResNet50 model (CNN models with 50 convolutional layers) classifies the image with two categories: βsports_carβ and βracer.β
Visualizing the activations of the last layer in relation to the βsports_carβ classification:
The neurons of the last layer are clearly activated by the front parts of the two cars. For further exploration, putting an image of a non-sporty car (i.e a Honda Civic) could be useful to explore how the model differentiates between typical vehicles and high-performance vehicles.
Visualizing the activations of the last layer in relation to the βracerβ classification:
The same pixels around the cars are highlighted for the βracerβ classification, even though an individual can still be classified as a racer without being near cars. While it is possible (and even good) that the model is able to use the context around an object to determine its classification, the model not strongly highlighting any of the pixels on the person near the cars creates distrust in some of the modelβs classifications. If the person in the middle of the cars was not present, would the model still identify the image in the βracerβ class? If the cars were not present, would the model still identify the image in the βracerβ class? In a nutshell, Grad-CAM provides a window into the decision-making process of CV models by allowing human users to understand the pixels in an image that influence its decisions.
Conclusion & Interpretability with Large Language Models (LLMs)
A common argument against explainable AI (techniques like LIME, Grad-CAM, and SHAP) is that they explain which inputs affect the output and by how much (by input perturbation, as seen in LIME, or by analyzing the last convolutional layer, as seen in Grad-CAM), but not the underlying reasoning (the why) behind its classification. According to Tim Kellog (2023), ML Engineering Director at Tegria, when a modelβs explanation βdoesnβt match your mental model, the human urge is to force the model to think βmore like you.ββ The purpose of this paper is to explore AI interpretability for its purpose in helping humans trust it more; humans might tend to distrust AI even more if they see it making decisions based on a decision process that they themselves do not follow:
Jaspars and Hilton both argue that such results demonstrate that, as well as being true or likely, a good explanation must be relevant to both the question and to the mental model of the explainee. Byrne offers a similar argument in her computational model of explanation selection, noting that humans are model-based, not proof-based, so explanations must be relevant to a model (Miller, 2019).
People are far more likely to trust explanations if they match their current way of thinking β not if they invent a new thought process (even if it is still correct). Kellog (2023) remarks:
I had seen this phenomenon a lot in the medical world. Experienced nurses would quickly lose trust in an ML prediction about their patient if the explanation didnβt match their hard-earned experience. Even if it made the same prediction. Even if the model was shown to have high performance. The realization that the model didnβt think like them was often enough to trigger strong distrust.
Through understanding trust in AI through the lens of sociology it can be observed that humans want to trust AI like they trust another human β they want to be able to probe it to find out more and understand how it reasons. Large language models (LLMs) like ChatGPT or Claude happen to act more human than any other type of model thus far. They can be probed to explain their thought process, asked for more information, and fact-check themselves.
A common argument against LLMs is that they cannot always be trusted β which is a nonissue if society reconsiders its interactions with LLMs to be similar to an individualβs interactions with real people. It would be naive to believe whatever someone says to you without doing any internal fact/logic-checking. This same level of constantly questioning the information society gives in the media or by other people can and should also be applied to information received from LLMs. In a quest to make AI as trustable as possible by making it as human as possible, users must acknowledge that this also makes AI susceptible to the same βhallucinatoryβ or made-up information that humans can propagate.
To increase societyβs trust in AI, it must be designed to act more human β but not a human that spreads rumors or makes up facts, but one that is consistent with its thoughts, viewpoints, and presentation of information, and one that is able to cite its sources.
- Consistency in AI is an issue that has already been solved with the temperature variables, which controls the βrandomnessβ of the LLMβs response. LLMs are set algorithms with the same weights that mathematically provide the same output to every input. However, commonly used-models like GPT often have a temperature setting other than 0 which forces the model to randomly use a word that is not the most probable β introducing randomness and βcreativityβ to the LLMβs writing (Prompt Engineering Guide, 2024). If LLMs were allowed to be far more deterministic (provide the same response for every input), it would be far easier for humans to trust them because they would be far more reliable to use.
- It is possible to use Retrieval Augmented Generation (RAG), which expands the knowledge base of an LLM for specific responses. Microsoftβs Copilot has the ability to actively search Bing during a response and cite the websites it retrieves information from (Microsoft, n.d.). While still in its infancy, LLMs can use RAG in a reliable way to cite all information they provide from external sources. LLMs are simply language algorithms that can be fed more information and glue that information together β it is not necessary for them to always fallback to their training data to get information if they can be given it during their response.
Interpretability might not be what society is looking for in AI; characteristics of humanity might be far more important than raw explainability for society to truly adopt and trust AI in dictating important decisions.
Thank you for reading!
References
Ahmed, I. (2022, April 5). Interpreting Computer Vision Models. Paperspace Blog. https://blog.paperspace.com/interpreting-computer-vision-models/
Armstrong, K. (2023, November 24). Tesla FSD v12 Rolls Out to Employees With Update 2023.38.10 (Update: Elon Confirms). Not a Tesla App. https://www.notateslaapp.com/news/1713/tesla-fsd-v12-rolls-out-to-employees-with-update-2023-38-10
Awati, R. (2022, September). What is convolutional neural network? SearchEnterpriseAI. https://www.techtarget.com/searchenterpriseai/definition/convolutional-neural-network
Computer Vision. (2019). IBM. https://www.ibm.com/topics/computer-vision
Kellogg, T. (2023, October 1). LLMs are Interpretable β Tim Kellogg. Timkellogg.me. https://timkellogg.me/blog/2023/10/01/interpretability
Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1β38. https://doi.org/10.1016/j.artint.2018.07.007
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, February 16). βWhy Should I Trust You?β: Explaining the Predictions of Any Classifier. ArXiv.org. https://arxiv.org/abs/1602.04938
Saravia, E. (2024). LLM Settings β Nextra. promptingguide.ai. https://www.promptingguide.ai/introduction/settings
Your AI-Powered Copilot for the Web. (n.d.). microsoft.com. https://www.microsoft.com/en-us/bing
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI