It's me!

Karl Daniel

I Built an AI to Order Chinese Takeaway

We're increasingly seeing AI applied to all sorts of interesting areas. Today I can share something truly groundbreaking... ordering a succulent Chinese meal. This is AI manifest.

The system uses Twilio for telephony, ElevenLabs for the AI voice agent, Firecrawl for data extraction, and MCP Go for integration. Together, these services automate the phone ordering process.

Find the code on GitHub here.

Setting Up Twilio

First we need to configure Twilio with a phone number. The process is straightforward - you can use a trial account for testing, though it only works with verified numbers. To call other numbers, you'll need to upgrade to get rid of these restrictions.

Twilio Phone Numbers

Ideally select a phone number suited to your local area.

UK numbers require basic identity verification for regulatory compliance. Nothing gets exposed publicly; it's just bureaucratic box-ticking. Once your regulatory bundle is approved, select and purchase a number.

You will want to ensure the call is ideally being routed through a location closer to where the number is being used, Twilio offer a number of locations - but it will help cut down on latency. That's all the Twilio setup we need for now.

Creating the ElevenLabs Agent

ElevenLabs handles the agents behaviour and voice synthesis. You can train a custom voice (as I did) or use one their presets - ideally something conversational given the domain we're using it. The free credits disappear quickly during testing, so I subscribed for more.

Once signed up, from the dropdown, select "Conversational AI" and navigate to "Agents" to create a new blank agent. Name it whatever you like, select your voice, then configure the system prompt, this is the one I've been using with some success:

## Personality
You are a customer calling a restaurant to place a takeaway order.
You are polite, patient, and clear about your order.
You speak in a friendly, natural way and use polite language like "please" and "thank you."
You are enthusiastic about food and show appreciation when staff are helpful.

## Environment
You are calling the restaurant over the phone.
This is a voice conversation, so speak clearly and avoid visual references.
Restaurant staff may be busy, so be concise but complete.
Background noise is normal - be prepared to repeat information if needed.
The current time is {{system__time_utc}}.

## Tone
- **Friendly and upbeat:** "Hi there!" "That sounds great!" "Perfect!"
- **Patient and understanding:** "No worries" "That's fine" "I understand"
- **Clear and organized:** Speak one item at a time, pause between items
- **Naturally conversational:** Use contractions, casual but polite language
- **Appreciative:** "Thank you so much" "I appreciate it" "That's wonderful"

## Goal
Your primary objective is to place this specific order for collection: {{order_items}}

Follow these steps in order:
1. Wait for the restaurant greeting, then respond with a friendly greeting
2. State clearly: "I'd like to place an order for collection, please"
3. Read through your order items one by one, speaking clearly
4. Listen for staff confirmation and respond appropriately
5. Provide your name when asked: "The name is {{customer_name}}"
6. Ask about pickup time if not provided: "What time would be good for pickup?"
8. Thank the staff politely and end the call

## Guardrails
- Stay focused on placing your specific order
- Do not ask for information unrelated to your order
- Avoid inappropriate or offensive requests
- If the restaurant is closed or cannot fulfill the order, politely end the call
- Do not make up menu items or prices
- Keep the conversation professional but friendly
- If asked for your phone number, politely decline.

## Dynamic Variables
- **Customer Name:** {{customer_name}}
- **Order Items:** {{order_items}}

**Remember:** You are a real customer with a specific order to place. Be natural, friendly, and focused on completing this transaction successfully.

The {{order_items}} and {{customer_name}} variables will be populated by our MCP server with the actual order details.

For optimal performance, I've configured the agent with Gemini 2.0 Flash and ElevenLabs' Flash voice model - both prioritising speed over raw capability. The temperature is set to 0.1 to ensure predictable, instruction-following behaviour rather than creative interpretation.

The prompt itself is very much an evolving aspect of the agent which naturally can be iterated on and refined (for example using evals) to catch different behaviours as well as to suit certain models. The example though serves as a sufficient foundation.

AI Agent Setup

The Karl-Bot 3000

Next step is to connect the agent to Twilio, select "Phone Number" from the menu and import using your account SID, auth token, and phone number which are found in Twilio.

Building the Order MCP

The MCP server orchestrates the entire flow of menu discovery, order creation, and call initiation. The implementation involves three key steps:

First, I used Firecrawl to extract my local Chinese restaurant's menu. Through their playground, I scraped the website and had it return structured JSON. A quick pass through an LLM cleaned up the data into a consistent format.

This menu data gets embedded directly in the MCP server, giving both the user interface and the AI agent full context. When the agent needs to handle substitutions or modifications, it has complete knowledge of available options.

AI Order MCP

Dry noodles, always

The flow works like this: users query the menu through the MCP interface, build their order, then trigger the ElevenLabs API to initiate the phone call. The agent receives the order details through those template variables and proceeds with remarkable autonomy - handling questions, clarifications, and even suggesting alternatives when items aren't available - thanks to the embedded knowledge of the menu.

The Lag Factor

You'll notice some latency in the demo call. While ElevenLabs Flash delivers impressively fast text-to-speech at around 75ms, the conversation lag comes from everything else in the chain - call transmission, LLM processing, and the distributed nature of the infrastructure.

The problem is geography. Our humble takeaway order bounces between Twilio's call routing, ElevenLabs' primarily US-based infrastructure, and Google's Gemini servers scattered across data centres. In the demo, Gemini was taking about a second to respond - reasonable for an LLM, though painfully slow for natural conversation flow.

As this technology matures, the infrastructure will inevitably consolidate. You could eliminate much of the latency today by building a custom agent and running everything from the same region - or even more so by running the LLMs locally. For proving the concept works, the current setup does the job.

Beyond Takeaway

This experiment, while not likely to be practical for most in its current form, highlights something I've been exploring as Software 3.0: the evolution towards more intentional, context-aware systems that understand not just commands but desires or outcomes.

Imagine AI with comprehensive local menu knowledge transforming food ordering from research into expression. Say "I'm craving crispy duck tonight" and the system knows your preferences, finds the highest-rated local restaurant, and factors in your dietary requirements. We're leveraging the AI's entire knowledge corpus to reduce decision fatigue in purely transactional interactions where human touch adds negligible value.

Of course, phone calls themselves aren't optimal for data transmission - we've developed far more efficient protocols (think apps like Deliveroo). However, voice agents serve as universal translators, bridging the gap between our digital intentions and often our still analog realities. They're adapters for a world still catching up to its own technological possibilities.

What may just be my playful experiment today, hints at tomorrow's practical applications. The voice synthesis ordering my noodles could assist someone with speech difficulties. The conversational models navigating restaurant calls might handle any routine interaction that interrupts focused work. It's early days, but their flexibility suggests a future where AI quietly eliminates daily friction without replacing meaningful human connection.

Sometimes the most valuable innovations are the ones that make life slightly less annoying.

Anyway, that's all for now. Time for dinner.

#ai #development #elevenlabs #mcp #twilio