The new opportunities in AI come down to affecting user behavior. Already, Facebook, Google, and Amazon, among others, deploy armies of scientists to keep us clicking, scrolling, and engaged with their ad funnel. And then there’s the whole politics of influence, which brings up an interesting development – the same AI tools that are available to these larger companies have now been extended to anyone who wants to use them for their own use.
App and web developers can now use these, but the most impactful applications exist on IoT devices. The impact happens because we are more influenced by physical things: haptics, colors, sounds, smells, heat, and movement. These can’t be replicated in apps.
While one might conjure up some runaway AI using these to manipulate humanity to do its bidding, if we’re transparent with users, we can potentially nudge them towards their stated goals using our devices and AI.
If our goal is to improve an index such as user happiness, we first need to measure it. Today, there are several tools available to do this that are getting even better than humans:
- Characteristic detection
- Language tone analysis
- Emotion detection
These tools can be combined to create new interactions and applications that weren’t possible before and even provide us with insights we weren’t capable of seeing. The requirement to gain these insights on the device is to have at least one or more microphones, use voice interaction, and potentially include a camera.
Look Who’s Talking
Speech recognition services are normally associated with returning text (“speech to text” APIs), however, they can also provide a wealth of information on the user. This information can either be gleaned in real time during the user interaction or potentially cached for future analysis. Through voice alone, developers can characterize the speaker by gathering:
- Gender of the speaker
- Speaker’s language
- Speaker’s age
- Biometric identification of the speaker
On top of this, it’s possible to detect if one or multiple people are speaking.
A savvy UX developer might then set the gender of the text-to-speech engine, an accent, cadence, or other speech features to match those of the user. This can have the effect of putting the user more at ease. Also, by identifying the user who’s present, it’s possible to also tailor the content specifically to that user, or to load their profile.
Companies that offer APIs to do identification and classification include Microsoft, Alchemy, Kaggle, and others. Their business models vary from micropennies per API call, to a flat fee, to a per-device license.
The next step of analysis is in understanding the more subtle meaning of what someone is saying. While natural language understanding can break down a statement into context, understanding both the intent of the user as well as the entities being referred to, sentiment analysis picks apart the choice of words of the user. Several services now offer the ability to analyze text and provide back different parts of language.
IBM Watson is one such service. If you feed it text, it will return back multiple aspects of the person’s use of language and personality:
- The Big Five (Agreeableness, Conscientousness, Extraversion, Emotional range, Openness)
Google offers a service called sentiment analysis, Bing/Azure also provides this as Text Analytics. Others include Qemotion, Text2Data, and Opentext.
One of the limitations of these services is the amount of text needed to do the analysis. In terms of Watson, they need at least 100 words. This is typically longer than what someone would use in commanding a device by voice or a typical input.
There are a few ways a device maker could address this. First, there’s a way that could make users’ a bit uneasy: continuously recording and transcribing the conversation. The limitation of this method would be that the service would also need to diarize the conversation if there were more than one person speaking and typically continuous transcription is prone to error. Other slightly less creepy sources are voicemail transcriptions or voice messages on apps like Whatsapp.
Another method is to accumulate the utterances over time and then send them for sentiment analysis once a minimum length has been reached. The plus side is that this is fairly easy to implement. The downside is that it doesn’t provide real time analysis and because both the time in between samples and the content and context can be different, the analysis might be skewed.
The other approach is to fuse sentiment data from other sources with the voice interaction. For example, if someone has just sent an angry text message or written a loving email, we could get a clear idea into their state of mind and then tune a response to a voice request as a result of this knowledge.
At CES four years ago, I remember getting exposed to Beyond Verbal’s emotion detection. It was fantastic to see how this technology could seemingly identify different speaker’s emotions in real time. You can check out their demo here:
Today, there are a few other companies that can do this as an API as well as through embedded software. These include Affectiva, EmoVoice, and Vokaturi. In addition to voice, APIs now include machine learning vision to provide realtime emotion data and provide personality information.
Bing, for example, provides both age, gender, and emotion based on facial analysis. Any device with a camera could take stills, upload them to the API, and continuously feed the information back to any apps running in parallel on the device. Perhaps there could be triggers based on negative emotion detection?
In putting together these features, there are some low hanging fruits for helping technology manipulate us to meet our goals. These goals might be explicitly stated by the user or extrapolated by the device.
The first application is matching. When I was doing cold calling to southern Kentucky, I’d occasionally catch myself adopting a drawl and slowing down my speech when prospects answered the phone. The subconscious effort was to make myself more relatable to the person I was speaking with.
There isn’t a big barrier for AIs to detect and do the same thing in our voice interactions. Cadence, gender, and tone can be matched very quickly. Based on sentiment analysis, we can also adapt the terseness of the interaction. Are the user’s responses short? Then our responses should be short as well.
The second application is creating reactions to negative emotions. When detecting negative emotions, a system can try various responses to mitigate the negativity:
- Playing music the user likes
- Changing the color of lights
- Initially increasing and then reducing speech volume
- Changing the acknowledgement tones
- Changing the language used in a response
The challenge is that developers now have to juggle a matrix of inputs and responses. For example, if Amazon were to enable emotion detection as part of Alexa Skills Kit, it would pass both the user’s request as well as the user’s primary and secondary emotions to the Skill builder. The Skill builder would then have to create responses not only for the user’s request but for their emotion.
There is an opportunity for developers to create automated responses based on emotion and to start layering on adaptations as we know about the user’s state of mind. This is where we can apply machine learning to understand which adaptations have the most impact on the user’s state of mind.