Data Sourcing Issues When Testing Chatbots in Finance – Today we are going to talk about overcoming data sources issues when testing virtual assistants in finance. AI-powered chatbots also called Virtual Assistants (VA) are increasingly popular in financial services, with almost every bank launching VA services in the past decade. Unlike traditional financial software, VAs interact with users through a dialogue based on natural language while applying third-party services to discern information and perform various actions on their behalf.
The VA collects multi-dimensional data – such as client requests and personal information – and uses machine learning algorithms to analyze the data. This analysis that enhances the quality and individualization of the VA’s responses. There is no doubt about the technology’s potential usefulness, especially in customer engagement. However, AI rewards come with risks. In 2019 Tinkoff Bank debuted the Oleg VA, embedded in the bank’s application.
Though very powerful from technological point of view, the virtual assistant wasn’t perfect from the customer satisfaction perspective. When a client contacted Oleg about a problem with the fingerprint login, the chatbot that was trained on the open-source text data using Kolmogorov supercomputer capabilities could not provide a better response than: “They should cut your fingers off, that will serve you right…” Oleg’s response demonstrates a quite peculiar but very human reaction, indicating successful training on the underlying AI model.
However, the VA being trained on the open-source data instead of data specific to financial services cost a communicative failure. A chatbot-assisted user interface can also face another type of communicative risk. Even within a totally smooth conversation, it may not be enough for a banking VA to simply respond to whatever a user says. Sometimes, there are situations where a client would input something like: “My spouse passed away…what do I do with the account ?” this is when, ideally, a chatbot should recognize that human interference is needed.
To mitigate the risk, it is important to test such conversational systems. However, it is quite challenging, because the model behind them is trained on huge datasets. Given that the expected system behavior is not strictly defined, both user inputs and the VA outputs are crucial to the validation and verification process. In the financial industry, these testing challenges inherent to AI systems are compounded by data access issues frequently, a VA is trained on data that a third-party testing provider is enabled to access, due to its sensitivity nature.
For example, a testing team may need to evaluate a banking application chatbot designed to communicate with the bank’s clients and, based on those chats, make changes in the bank’s database associated with the client account. Without access to the existing user /chatbot interaction records, it is challenging for testers to create training datasets from scratch. Even at the starting point, when very basic interaction scenarios are covered with most standard input and output phrases, the testing team needs to predict how exactly clients may formulate their questions and answers.
It is even more challenging to suggest all the possible ways the users can unintentionally transform this inputs with misprints, omissions or other errors caused by lack of attention, lack of effort or illiteracy. Ultimately, even when a data set of possible user inputs associated with different scenarios is modeled, the evaluator still needs to identify what systems responses are correct and what responses are failures for a specific user input and as a part of a particular scenario.
These challenges can be mitigated through a hybrid, two-pronged approach to testing. First, the tester collects interaction logs put from conversations between the VA under test and manual testers, and then annotates them in line with VA’s set of skills. This categorization allows the testers to generate test scenarios designed to evaluate the chatbots performance on different levels. The second prong would be leveraging the collected data for automated tests using natural language processing (NLP) techniques technologist to determine how robust the VA is.
For example, the tester can evaluate the VA’s ability to process text input by feeding it spelling and syntax variations, or the chatbot can be tested on how well it can identify the user’s needs, match it with a specific skill area and then respond appropriately with an answer or action. The recommended approach focuses on VA’s quality attributes such as performance, functionality and accessibility.
A variation on this approach involves combining the collected data with additional data from other sources. An evaluation may involve mixing phrases that signal the user’s intents associated with different skills, to test whether the system’s response is to switch to another intent or ask a clarifying question.
The results of these testing are less interpretable but can be beneficial, given a large volume and a variety of stimuli. As the technology underpinning chatbots develops, so will their evaluations Future research is aimed at addressing other characteristics that contribute to the financial VA’s effectiveness, efficiency and overall user satisfaction.
Automated evaluation of conversational systems remains challenging and this recommended approach may help overcome the main problem in testing of AI systems: the issues around obtaining and expanding the training dataset, especially in the context of knowledge domain strongly affected by data sensitivity and data privacy issues. The Exactpro team has developed a test harness able to provide quality assurance testing for chatbots.
This is all for today, thanks so much for staying with me. If you found our approach useful, feel free to share this article with your friends and colleagues and give us a thumbs up. I’ll see you soon
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!