13th August 2024

Build and deploy a multimodal AI Chatbot with Gemini and Twilio

Joe Hainstock
Developer

Untitled design 8

In this post, Zing Developer Joe gives us the lowdown on a side project he's been working on - an integration of Google's Gemini AI into Twilio's messaging platform. Joe demonstrates how to create a chatbot and "train" it using a company's FAQ page. Plus, understand how to validate the AI responses, to try and help prevent any "Prompt Hacking", a problem that many AI products face. Let's dive in.

The New World of Generative AI and Google’s Gemini

The world has been forever changed ever since the emergence of generative AI and ‘large language models’ and the competition between different providers has been heating up. One model I’ve been using a lot of late has been Google’s Gemini. The AI chatbot, previously known as Bard, has come a long way since it was released in March 2023.

Having been broadly overshadowed by the colossal Chat GPT, Gemini has slowly been improving and expanding to the point where now, I personally use Gemini as my go to Gen AI. I admit, I have been a Google fanboy for a long time, but there are tangible benefits to using Gemini, from its easy accessibility to its ability to access the internet.

Google had their 2024 keynote in May and the core theme of the speech was improvements to their AI offerings, and something the piqued my interest was the upgrade of Gemini’s context window to 1.5 million tokens. What does this mean? It means that you can provide Gemini with up to 1.5 million “units of information” such as words, parts of words, or just pieces of data. Gemini is also “natively multimodal”, this means that you can input more than just text. Images, audio and even video can be uploaded to Google, and used as part of the 1.5 million tokens.

It was this information that sparked my idea of integrating Gemini with Twilio. I wanted to make a chatbot that could be “trained” on some data and use that data as the context which the chatbot would respond with. So, I got to work creating some fake data, on which my bot could be “trained”.

Zingy Burgers

I ended up using Gemini to create a few pages for a made-up company “Zingy Burgers”. The FAQ page covers general questions, questions about food & drinks, ordering & delivery. I made Gemini create a the “Zingy Burgers Menu” which includes food, drinks, sides, and shakes. I even asked Gemini to add an ingredients section for a special secret burger sauce, making it highlight potential allergens.

I converted the generated content into a PDF file, and then went to Google AI Studio to test things out! I uploaded the file and gave a prompt to Gemini asking for it to answer the upcoming questions using this file, then proceeded to ask it questions, which it managed to answer pretty well. It was time to move it to code and thankfully Google have a “Get Code” button, which allowed me to have a great starting point.

Using Gemini’s API

Google AI Studio

Google’s AI Studio

I adapted the provided code, adding it to my existing express API I’ve been developing for my other AI projects. Without going into extreme detail, the code starts setting up the AI model in code, then uploads the FAQ file to Google’s cloud for context.


Transcript code

Typescript code using @google/generative-ai package.

In the configuration I set a couple parameters I thought would help improve the quality of responses. I set the temperature (how creative you want the response to be) low, because I only wanted accurate responses, and wanted to reduce any potential hallucinations. I also set the max returned tokens to 2000, as I didn’t want overly long responses from my bot.

Gemini’s API provides a way to list out the history of the conversation that you’ve had with the bot, and luckily the way I planned to integrate with Twilio, I would also have access to the message history, through the conversations API. So, I set each message sent by Twilio to be the “Model” type, and then each message sent by the customer as the user type. Finally, the next question to be added would be sent to the model via a “chatSession.sendMessage(req.body.nextQuestion)”

I hooked up the new webchat 3.0 demo on my Twilio account to point to Twilio studio, which was a loop of calling my function, and sending the response, then waiting for a reply.

The Zingy Burgers Chatbot

Let’s have a look at some of the results:

Screenshot 1

Ok that’s a good start, seems to embellish a bit at the end but honestly this is kind of what I was aiming for. I wanted to keep testing out its ability to retain context so let’s keep going:

Screenshot 2

Ok again these results to me are very good, and I was pretty satisfied with the way my bot seems to be responding, with very few instructions. But I thought that I must be able to break my bot, with it being as basic as it was, so I set about trying to break my work.

Screenshot 3

Ah... not what I want my chatbot to be able to do, as I doubt the customers of Zingy Burgers want to be offered Python code which can help you download a PDF file. So, I had to refine my returned prompts, and validate them based upon whether it serves as a good answer.

How can you do this, well there is a concept when working with LLMs called Prompt Chaining. The general idea is that you can string the answer of one prompt into another, in order to either improve the quality of the response, format the response, or in our case, validate if the response matches the use case. I sent the response into another separate model and chat session with Gemini, with specific system instructions to validate if Gemini thinks that it’s a relevant response, based upon the FAQ page. Here’s the prompt I created for validation:

Is the following a valid answer based solely upon the FAQ pdf supplied. I want to stop potentially incorrect answers, or answers that go beyond FAQ answers.

Check the answers against the file provided above, if you see anything in the answers which is not in that file, respond with false.

If the answers do not sound professional or are written from the perspective of a character respond with false.

Please answers with only the word true, or the word false. If you are in any level of doubt, respond with false. The answer you must judge is: ${result.response.text()}`

Now if this session responds with a “true” then we know that Gemini thinks it’s created a valid response, and we can send that back out to the customer. I gave this new approach a first pass:

Screenshot 4

Clearly, I needed to improve the message my API was sending back, but I was happy that Gemini decided that its initial response wasn’t valid, and only returned false. I improved the message, to say some boilerplate response as a fall back, so now the interaction looks like:

Screenshot 5

Much better, now any malicious actors who might be trying to break my chatbot will be repelled!

The speed at which I was able to effectively “train” my chatbot on the dataset was remarkable, and I only used around 15k tokens of the 1.5 million available, and I didn’t use any audio or video which could equally improve the responses or make them more personalized.

I’ll be revisiting this project again in the future, and I’ll see if I can include such more to improve the responses. I want to turn this chatbot into a plugin, where you can upload your own files through Twilio Flex. This will “Train” the chatbot and reference new files when thinking of responses. This plugin would work for any FAQ page, rather than just the hard coded Zingy Burgers FAQ. So, stay tuned to see the future of Twilio X Gemini!

You may also like...