How I built a robot me
Siri-ously
Bots are hot. Perhaps it’s because chat (via Slack) has finally taken hold of company communications; perhaps it’s because everyone’s realized that nobody installs apps. Maybe some engineers are hankering after the cozy days of the command line. Perhaps it’s all of the above.
Late last year, I decided it would be fun to build a bot engine. There were a few around, but none did exactly what I wanted: create a personal Siri that would represent me to the world, and allow me (and only me) to ask questions about my personal information.
Most importantly: I had to be able to add new capabilities super-easily, because this was a side project. (Machine learning algorithms were out of scope.) I wanted to talk to it. And I wanted it to be accessible via the web.
Building understanding
A lot of bot engines — particularly the ones designed for Slack — simply listen for specific requests by performing string matching operations. That means you have to give them an exact command, with no nuance to your language. “Tell me a joke” might work; “please tell me a really funny joke” might not.
In the context of a command line like Slack, that’s probably fine. However, when you’re speaking, you need to be able to add more variance to your language.
I decided to try and boil every query down to a signature that would allow it to be stored internally, but would also allow for a certain amount of freedom.
A Part-Of-Speech Tagger (regrettably abbreviated as POS Tagger) takes a sentence and assigns a part of speech to each word. For example, “cat” is tagged as a noun, and “cats” is tagged as noun-plural. The sentence “the cat sat on the mat” is split up into a data structure that looks something like:
DT: the, NN: cat, VB: sat, IN: on, DT: the, NN: mat
The individual part of speech labels here correspond to “determiner”, “noun”, “verb” and “preposition” respectively.
Amazingly, Stanford makes their Natural Language Processing libraries available to everyone, including their POS Tagger. I was able to take this and build a library that would take any text request and boil it down to a simplified data structure.
From there, I created a simple plugin structure that matched these requests to the parts of a question that are really important. I can attach a microservice to a small, weighted request signature; when it matches a request to a signature, the bot passes the important information to the service as a snippet of JSON. The service returns a small snippet of JSON back, and the bot reads it out.
Okay, that’s pretty technical. Non-engineers are wondering what I’m talking about, and NLP experts are shaking their heads at me for building a hopelessly naïve implementation. But the important thing to know is: it’s really easy. It now takes no more than 5 minutes to add a new capability to my bot. And it works.
Talk to me
Did you know HTML5 has a speech recognition API? Chrome is the only major browser to implement it so far, but you can build an application that will listen to your voice and convert it to text. You can also build an application that will read text back to you.
Because my bot was intended to be an approximation of me, I decided to use the British male voice. It sounds a lot like Matt Berry. I do not. Whatever; my website talks to me, so there.
Benbot listens using the HTML5 speech recognition API. Once the browser has detected that you’ve stopped talking, I take the full text of the query and pass it back to the server, which attempts to match it against the available request signatures. If it can’t match anything, it replies, “I don’t have an answer for that”. I should come up with something wittier.
In the example above, my bot knows I checked into the Gourmet Ghetto because I saved the checkin on my own website, which is acting as a kind of personal, public API for information I choose to share.
Type to me
As I mentioned, only Chrome understands speech right now. That’s kind of sub-optimal. For one thing, that means I can’t talk to my bot from my iPhone (which only really supports the Safari engine) — for another, I’d like everyone to be able to use it, too.
If you access bot.werd.io from a browser that can’t handle speech, the area of the page that would have represented your speech is converted into an editable form with ContentEditable. When you press enter, your query is sent to the callback using the same mechanism as if you had used speech.
Cut me some Slack
Sometimes, a browser isn’t enough. Because the signature-matching query algorithm is separated from the interface, I can build entirely new ways to talk to my bot. For example, I built a Twilio integration that lets me talk to it via text message (pictured), and a Slack integration. It would be trivial to add new kinds of interfaces in the future.
Towards better questions and answers
Bots are fun, but there’s a lot of user experience work that needs to be done to make them usable. It’s a similar problem to those Infocom text adventure games from the eighties: there’s a world of possibilities, but it’s hard to know which commands to use and when. Just as text adventures grew into graphical point-and-click adventures and command line interfaces for normal people grew into WIMP, we’re seeing platforms like Slack add new, simpler interfaces for interacting with bots. When you can add new functionality all the time, discoverability is key.
This is a problem that’s being solved. But bots provide another key opportunity: as you can see in the screenshot above, every answer — no matter which underlying platform is used to gather the information — is returned in a simple conversational text format. That means you can easily pipe answers together to get new, integrated information that you couldn’t get from a single platform alone. Example:
“How much is a flight to London for a week in September?”
“The cheapest round trip flight from San Francisco to London in September for a week is $919, on September 7th to September 14th.”
“Can I afford that?”
“Yes.”
“Buy it please.”
As a conversation continues, it picks up contextual data that it can use to provide better answers to further queries. Not only are you creating a more detailed profile for yourself overall, but you’re providing more focused queries in the scope of the conversation.
Finally, what if everyone had their own personal bot? What if those bots could talk to each other and act as our personal agents? What if our bots could mutually pick a time for us to meet, or a restaurant, or a movie to see, based on our preferences as a group?
What’s next?
My bot is a fun side project. (I also briefly used the engine at Known, for a product that didn’t see the light of day.) It’s an easy way to find out where I am or ask me common questions that I might not be around to answer.
I’ll probably make it easier to figure out what you can ask it (although I’ll keep some easter eggs, because those are fun). I might build some new interfaces. I also have an idea for a way for any website to publish services that can be consumed by any bot — but that’s something for another time.
I’m curious: what do you think I should do with it? What would you do with your own robot agent? Let me know in a response, either here or on your own site.