The Four Pillars of Chatbot Architecture
And Why Two of these Pillars Need to be Deprecated
Most chatbot architectures consist of four pillars, these are typically intents, entities, the dialog flow (State Machine), and scripts.
The First Pillar: Intents
In most chatbot design endeavors, the process starts with intents. But what are intents? Think of it like this…a large part of this thing we call the human experience is intent discovery. If a clerk or general assistant is behind a desk, and a customer walks up to them…the first action from the assistant is intent discovery. Trying to discover what the intention of the person is entering the store, bank, company etc.
We perform intent discovery dozens of times a day, without even thinking of it.
Google is the biggest intent discovery machine in the world!
The Google search engine can be considered as a single dialog-turn chatbot. The main aim of Google is to discover your intent, and then return relevant information based on the discovered intent. Even the way we search has inadvertently changed. We do not search with key words anymore, but in natural language.
Intents can be seen as purposes or goals expressed in a customer’s dialog input. By recognizing the intent expressed in a customer’s input, the assistant can select an applicable next action.
Current customer conversations can be instrumental in compiling a list of possible user intents. These customer conversations can be data via speech analytics (call recordings) or live agent chat conversations. Lastly, think of intents as the verb.
The Second Pillar: Entities
Entities can be seen as the nouns.
Entities are the information in the user input that is relevant to the user’s intentions.
Intents can be seen as verbs (the action a user wants to execute), entities represent nouns (for example; the city, the date, the time, the brand, the product.). Consider this, when the intent is to get a weather forecast, the relevant location and date entities are required before the application can return an accurate forecast.
Recognizing entities in the user’s input helps you to craft more useful, targeted responses. For example, You might have a
#buy_something intent. When a user makes a request that triggers the
#buy_something intent, the assistant’s response should reflect an understanding of what the something is that the customer wants to buy. You can add a product entity, and then use it to extract information from the user input about the product that the customer is interested in.
The Third Pillar: Dialog Flow
The dialog contains the blocks or states a user navigates between. Each dialog is associated with one or more intents and or entities. The intents and entities constitute the condition on which that dialog is accessed.
The dialog contains the output to the customer in the form of a dialog, or script…or wording if you like.
This is one of the most boring and laborious tasks in creating a chatbot. It can become complex and changes made in one area can inadvertently impact another area. A lack of consistency can also lead to unplanned user experiences. Scaling this environment is tricky especially if you want to scale across a large organisation.
Deprecation of the Dialog Flow
What if we can deprecate the manual dialog flow creation process. Imagine if you define your intents, entities and scripts, but then automatically create your “call flow” via Machine Learning and supervised learning? A good example is the Rasa approach where the state machine is deprecated and superseded by ML. ML allow you to ditch state machines. Rule based approach is very rigid and happy path orientated. It is constituted by imagined user journeys. Machine Learning can use training data, pick up patterns from training data, and predict user behavior. Complex environments benefit from this.
Rasa core is not complicated to use. Rasa core uses intents, entities and previous actions to determine state and predict the next action. Intents Classification and Entity Extraction forms the basis of the conversation.
If you want to visualize your conversation and see the state machine tree as it is generated, there is a nifty visualization tool.
As chatbot development tools evolve, the deprecation of the state machine (Dialog Flow) will become commonplace.
The Fourth Pillar: Script
Scripts are the wording, the messages you will be displaying to the user during the course of the conversation to direct the dialog, and also inform the user. The script is often neglected as it is seen as the easy part of the chatbot development process. The underlying reason for this may be that the script is often addressed at the end of the process, and it not being technical in nature, it is seen as menial.
The importance of the script should be considered in the light that it informs the user on what the next step is. Or what options are available in that particular point of the conversation, or what the expectations are of the user. A breakdown in the conversation often due to the dialog not being accurate. Multiple dialogs can be sent, combining messages. On inaction from the user, follow-up explanatory messages can be sent.
Deprecation of the Script
The manual creation of a script or message for each step of the conversation can be deprecated and replaced by the process of Natural-language generation (NLG).
NLG can be viewed as the opposite of Natural Language Understanding. With natural-language understanding, the system needs to structure and interpret the user input sentence to produce a machine representation, with NLG the system needs to make decisions about how to put a concept into words. In a sense almost unstructure the data. NLG needs to choose a specific, self-consistent textual representation from many potential representations, whereas NLU generally tries to produce a single, normalized representation of the idea expressed.
NLG can be compared to the process humans use when they turn ideas into writing or speech.
In this video I trained a Tensorflow model with 180,000 news headlines. The objective was to auto-generate fictitious news headlines based on key words or phrases from the model. The accuracy was astounding, considering the small data sample.
Should there be a sufficient amount of training data, it a model like this should be able to yield specific, self-consistent, contextual textual responses based on intents and categories of user input.