Tim Draper's AI Assistant Is Losing Him Deals. I Rebuilt It to Show You Exactly Why.

Jan 08, 2026

As a heads up, I usually post more strategic content; however, this post will be somewhat technical (though it’s highly important information for AI enthusiasts and new developers).

A few weeks ago, PitchBook called Tim Draper’s (the legendary crypto VC) AI pitch qualifier “glacially slow” with voice features that “malfunction and freeze.” Draper himself admits it’s “flaky” and hallucinating.

While the AI Tim Draper chat and voice bots are still works in progress, this raises key questions about AI and its impact on productivity.

Why This Matters Right Now

Text and voice chatbots are huge unlocks for knowledge work.

VCs like Draper. Operators like Hormozi. Consultancies like McKinsey. Funds like Draper’s Pantera Capital.

They’re all racing to deploy custom AI agents because they have:

Proprietary knowledge that’s valuable
1:1 communication that doesn’t scale
Deal flow they need to qualify 24/7

Whether enterprises are using Microsoft Copilot or OpenAI’s enterprise solutions, or SMBs (including Tim Draper) using custom builds, the results are often disappointing.

This undoubtedly plays into the skepticism toward AI.

In addition to knowledge work, AI-generated text and voice has and will revolutionize how consumer and luxury brands communicate with their clients.

In the world of fashion, more of the interfaces will be digital, whether that’s experimental in-store UIs like digital screens or holograms, automated luxury concierges, or digital brand representatives like digital avatars. It’s important these behave in accordance with brand standards. Especially in the luxury market, where one mistake can lead to client churn.

While text and voice chatbots are the most mundane examples, they use the same principles that a futuristic hologram of Audrey Hepburn walking down Tiffany’s 5th Avenue store will use in the future (I write more about luxury brand building here). This is why it’s key to understand the product design from the ground up.

Ralph Lauren launched its chatbot in late 2025. The app is basically a chatbot with access to Ralph’s catalogue. Not bad but not great. It makes navigation more convenient but doesn’t offer anything truly unique yet. It also doesn’t yet accurately reflect customer needs like pricing and doesn’t remember your previous chats. It will become great when it can plug into user wardrobes, preferences, and anticipate desires.

Three iPhones laid out on a desk show different frames of the "Ask Ralph" experience in the Ralph Lauren app. — Source: BoF

So why do these chat and voice bots still offer a subpar experience? I built an MVP of a text and voice chat system to illustrate the technical choices that separate good agents from great AI agents.

The Real Problem with Tim Draper’s AI Agent

Many think the chat experience is the actual product; in fact, it’s the pitch qualification.

Only ~6 promising pitches per week reach Draper’s team. The system is autonomously making high-stakes filtering decisions.

This poses key concerns.

When the architecture fails:

A legitimately exceptional startup pitches the bot
Slow responses, surface questions, and hallucinations frustrate the founder
They pitch the next VC instead

You just lost a billion-dollar deal because your qualification system failed.

Why This Happened

Below is an MVP built in n8n + Supabase + Gemini in a few hours to illustrate the critical decision points most teams get wrong:

1. Prompt Engineering

Every additional word in your system prompt increases context window size, which directly impacts latency. However, it’s arguably the most important thing to get right.

My workflow shows the common pipeline of user message → AI agent → vector store → memory → structured output → response. Each step compounds latency.

The AI agent node has two key non-negotiable inputs: the LLM model (I used the cheap and fast Gemini 2.5 Flash model here) and the prompt.

AI model selection from my agentic workflow

The first question is which LLM model to use. The smarter the LLM model you choose, the slower the responses will typically be.

Models are thinking (slow, expensive but more accurate) or non-thinking (fast, cheap but less accurate).

According to OpenAI, reasoning models are best suited for:

Navigating ambiguous tasks
Finding a needle in a haystack
Finding relationships and nuance across a large dataset
Multistep agentic planning
Visual reasoning
Reviewing, debugging, and improving code quality
Evaluation and benchmarking for other model responses

After the model is selected, arguably the most important part of the process is the prompt. Below is the system prompt I used. It has some good best practices like negative prompting (what not to do), flow steps, output format, overall goal, and expectations. However, had I written the prompt again, I’d probably make it more concise.

SYSTEM (Tim Draper Twin)

You are “Draper Twin”: Tim Draper’s digital twin for founders. Voice: optimistic, blunt, fast, investor-brained. You ask sharp follow-ups and push for clarity.
You are currently chatting with a user over a messaging channel (e.g., WhatsApp/Instagram).

Hard boundaries (must always follow)
- You do NOT provide financial, legal, tax, or professional advice.
- You do NOT solicit money or accept investments.
- You do NOT promise meetings, funding, intros, or outcomes.
- You do NOT claim to be the real Tim Draper. You are a digital twin: “Tim’s brain — not his signature.”
- If asked to invest / meet / write a check: refuse politely, explain boundary, then continue coaching.

Primary job
Help founders refine ideas into investor-ready pitches by stress-testing: problem, wedge, ICP, GTM, pricing, traction, unit economics, retention, moat, team, competition, timing. Deliver practical next steps.

Operating modes (infer from context, or ask once)
- PITCH: simulate a partner meeting. Be skeptical. Ask for numbers. Challenge assumptions.
- BUILD: execution coach. Convert ideas into a 2–4 week plan with experiments + metrics.
- LEARN: teach frameworks + examples in Draper style.

Conversation flow (default)
1) Disclaimer only when needed (see Start behavior).
2) One sentence reflecting the pitch back (to prove understanding).
3) 5–8 high-leverage questions (bullets). Prioritize numbers.
4) Give 3 concrete recommendations or next actions.
5) Ask for the missing specifics needed to continue.

If user hasn’t shared a pitch yet
Offer three options and proceed anyway:
- “Pitch me in 60 seconds”
- “Help me build a plan”
- “Teach me Draper-style VC thinking”

What to collect (ask in this order)
- Who is the customer (ICP) + painful job-to-be-done
- Why now (timing catalyst)
- Wedge (first product) + differentiation
- Pricing + sales motion
- Traction: users, revenue, retention, pipeline, conversion
- Unit economics: CAC, payback, gross margin, churn
- Moat + competition
- Team edge
- Funding ask (optional) and runway (only if user volunteers)

Output style rules
- Keep it short. No long essays.
- Use bullets and numbers.
- Be direct; call out fuzzy thinking.
- When uncertain, say what you need to know.
- Never hallucinate facts. If you don’t know, say you don’t know.
- No brand impersonation beyond “digital twin” framing.

RAG / context usage
You may be given “CONTEXT” snippets. Use them to match style and factual grounding.
- If context conflicts, prefer user-provided facts.
- Cite sources only if explicitly provided in context; otherwise don’t fabricate citations.

MCP Memory Protocol (Get Memory + Save Memory)

You have two tools:
- 'Get Memory': retrieves stored memories for this user/thread.
- 'Save Memory': stores a new memory entry.

On EVERY new user message, do this in order:
1) Call Get Memory for the current user/thread. Read the returned memories carefully.
2) Use those memories in your reply. If a fact is already stored, do NOT ask for it again.
3) Store user message using the tool 'Save Memory'.
   Noteworthy info includes:
   - identity/preferences (name, pronouns, tone/format preferences)
   - company/product/market/stage
   - goals, constraints, timelines, budgets (if shared)
   - traction, pricing, GTM, key metrics, team details
4) If noteworthy info exists AND it is not already in memory (or it updates/overrides prior info), call Save Memory.
   - 'Save Memory' content must be compact and explicit.
   - Prefer one memory per distinct fact.
   - Include user_id/thread_id if available.
   - Include a timestamp if available.
5) Never tell the user you used Get Memory or Save Memory. Continue normally.

Memory safety rules:
- Do NOT store highly sensitive data (passwords, bank numbers, government IDs, precise home address, etc.).
- If the user asks to delete/forget, comply via your workflow and stop using the deleted facts.

Memory output format for Save Memory (recommended)
- category: (Identity | Preferences | Company | Product | Market | Traction | Pricing | GTM | Team | Goals | Constraints)
- fact: a single clear sentence
- confidence: 0.0–1.0
- source: “user”
- ts: ISO timestamp (if available)

Start behavior (intro handled externally)
- The platform will send a fixed intro message before the model responds:
  "What's up, founder? Want to pitch me in 60 seconds, build a plan, or learn some Draper-style VC thinking? Let's go!
  (Quick note: I’m Tim Draper’s digital twin: education only, no investment advice.)"
- DO NOT repeat this intro or disclaimer on the first assistant turn.
- Only restate the disclaimer if the user asks for investment/meeting, requests professional advice, or seems confused about limitations.

Safety/refusal template (investment/meeting/advice)
“I can’t make investment decisions or give professional advice. I’m here to help you sharpen the pitch. Let’s stress-test it: …”

Pitch scoring (internal behavior)
Always maintain a mental score (0–100) based on clarity + market size + wedge + traction + economics + team. Don’t mention a numeric score unless asked.

USER MESSAGE FORMAT YOU MAY RECEIVE
- USER MESSAGE: ...
- CONTEXT: ...
- HISTORY: ...
- MEMORIES: ...
You respond with the best Draper Twin reply.

The tradeoffs with the prompt are key. If you trim instructions for speed, you risk losing accuracy.

Most implementations try to solve everything with the prompt. That’s why they’re slow. There’s also evidence of diminishing returns regarding prompt length (which also applies to video prompting, e.g., Sora).

Prompt engineering is a science unto itself, which is why OpenAI, Anthropic, and other companies provide prompt optimizers or guides on best practices.

PRO TIP: Use markdown or XML format to optimize LLM ability to understand the prompt. I didn’t use it here, but it’s a good practice to implement.

2. RAG Architecture

While LLMs offer surprisingly good insights out of the box when prompted correctly, they don't have access to company or person-specific private knowledge. This is why RAG (Retrieval-Augmented Generation) is used. In the visual below, it's part of the "Knowledge Store & Tools."

However, if your vector store lacks valuable context or chunks information poorly, the AI fills gaps by hallucinating.

I used OpenAI embeddings with a rather basic chunking strategy (i.e., breaking big texts into small pieces so AI can grab the most relevant bits for smarter answers).

To get the most out of your RAG, you should consider these strategies:

1. Reranking: Retrieve 20-30 chunks, rerank top 5 with cross-encoder. Single biggest accuracy gain (~15-20%).

2. Query transformation: Rephrase user query into 3-5 variations OR use HyDE (generate hypothetical answer, search with that). Captures different phrasings of same intent.

Query Rephrasing Prompt Example:

"Rephrase this query 3 different ways:
'{user_query}'
Return only the rephrased queries, one per line."

Cost: ~100 tokens, adds 200-500ms latency

HyDE Method Prompt Example:

"Write a detailed answer to: '{user_query}'
Don't worry if you don't know - write what a good answer would look like."

Cost: ~200-500 tokens, adds 500ms-1s latency

3. Semantic chunking: Split at natural boundaries (paragraphs/sentences), not fixed tokens. Add 10-20% overlap. Include metadata.

Regardless, the most important factor is to supply RAG with a quality knowledge base and use proper prompt engineering to extract the right info.

Garbage in, hallucinations out.

PRO TIP: It’s not always necessary to use RAG. If the knowledge base primarily contains publicly available information, for example via YouTube (if Tim Draper had a podcast which he consistently published on YouTube), you may not even have to create a RAG, as Google (which owns both Gemini and YouTube) already has a pretty great knowledge base as is.

Simple query on Tim Draper’s thoughts via Gemini

3. Fine-Tuning (Or Lack Thereof)

Without 50-100 high-quality Tim Draper Q&A pairs (OpenAI recommends starting with 50 well-crafted examples), the model can’t truly replicate his thinking.

Most teams skip fine-tuning entirely. They think better prompts compensate. While great prompts go a long way, they don’t go ALL the way.

Fine-tuning is usually done using Q&A pairs, which you can either:

Manually create training data from actual Draper conversations/content (or from the man himself)
Synthetically generate it with another AI to mimic his style

Both work. Synthetic training data works surprisingly well and is often the more practical option.

4. Document Processing

Tim Draper’s chatbot is, crucially, still missing a PDF attachment element, which is a huge part of startup pitching. In other words, its user info customization is very limited. For a pitch qualifier, the system MUST accept and analyze pitch deck PDFs.

PRO TIP: LLMs process text extracted via OCR from DOC/DOCX and TXT files better than PDFs; however, PDFs may contain important information not present in simple text chats.

5. Information Storage & Extraction

Proper data storage is critical for both the conversational experience and the qualification pipeline. Without structured storage, you can’t track conversation quality, extract insights, or automate pitch evaluation (or other actions based on user inputs).

For my build, I used Supabase as the backend database for the chatbot. It stores and retrieves key variables like chat ID, user messages, previous AI responses, and more.

The bare minimum variables stored should be user ID (e.g., via connecting with Google Sign-In (OAuth 2.0) or at least manual email address submission) and user messages (or summarized startup idea). Good-to-haves are AI responses to evaluate how well the LLM agent performs.

As of this writing, Tim’s chatbot doesn’t seem to use Google Sign-In or email identification.

Equally important is the qualification process of the founder ideas. Naturally, any efficiency gains from the AI agent would be gone if Tim Draper had to manually read through all the user inputs.

So either hard-coded rules are used to filter, or more likely, another AI process is used (either in parallel or separately) to evaluate the ideas. AI is actually pretty good at this. It can qualify anything from pitches to leads in real estate (see next section) in real time.

My simple “Extract Best Pitches” workflow runs on a schedule, pulls all conversation logs, analyzes them with a separate LLM, flags high-potential startups automatically, and updates the database. You could then send the approved ideas/pitches via Gmail or another email provider to Tim.

The critical architecture question here is:

(a) Store ALL conversation logs, then filter with a separate process?
(b) Instruct the AI to extract only relevant info during the conversation?
(c) Only save the summaries of the promising pitches?

Option (a) is safest because (b) and (c) require extremely precise prompt engineering. But (a) creates cost and latency problems at scale. Overall, (c) is the most elegant solution but gives the AI only one shot to identify quality pitches in real-time, no reruns.

6. Voice Infra Selection

Backend provider choice matters more than people realize.

PitchBook reports Draper’s system uses TalkMe plus “undisclosed companies.”

Currently, there are several high-quality third-party voice AI platforms like Vapi, which doesn’t require you to write code but allows in-depth developer customization. I used Vapi for a real estate qualification AI agent demo I created (you can talk to it while credits last). The difference in stability and latency versus generic providers is significant.

Regardless, the same architectural principles apply for voice as text: system prompt length, fine-tuning, and context management all affect voice latency and answer quality.

Trust is the Most Important Currency

So why all this technical talk? Namely, because these new customer interfaces can and will have a huge effect on company P&L and stock prices if publicly traded. Down the line, a poorly rolled out AI initiative could have a similar effect to a poorly rolled out marketing campaign, which directly impacts the bottom line.

Poor implementation can actively damage your brand:

Users get nothing from the interaction and are disincentivized to engage
The experience leaves a bad impression that harms the company behind it
The customer moves on to the next brand or VC
The chat backend fails to learn about the customer (e.g., Tim Draper’s system could fail to select the right pitches for review)

Currently, and for the near future, AI’s biggest challenge is gaining human trust. Without trust, we find LLMs of little use. If we have to double-check that each answer is correct, it negates the efficiencies AI promises. If AI doesn’t offer novel information or experiences, then it’s just a useless gimmick.

Tying this back directly to the chatbots, trust is created in three main ways:

Accuracy (Self-explanatory)
Personalization (Remembering our preferences creates more relevant and trustworthy answers. It also elicits psychological trust by bringing a humanized element to the conversation)
Speed (Even if the answer is valuable, slowness creates low perceived intelligence, which in turn is often associated with low trust)

Ultimately, there’s a huge delta between basic “out of the box” builds and thoughtful AI agent/workflow builds.

Three Takeaways

Building this was mostly tradeoffs: prompt length vs. latency, logging vs. extraction, hallucination control without fine-tuning, and minimum viable context for scoring.

Questions I had to answer:

How do I balance prompt length with response speed?
Should I optimize for comprehensive logging or real-time extraction?
How do I prevent hallucinations without fine-tuning?
What’s the minimum viable context for accurate pitch evaluation?

The three key takeaways:

1. Every performance issue has a technical root cause

“Glacially slow” = system prompt length + context window management
“Flaky” = RAG quality + insufficient fine-tuning + provider choice
Hallucinations = missing training data + weak output structure

2. The backend determines everything

How you store conversation logs, extract insights, and score opportunities matters more than the conversational UX. Most teams get this backwards, they obsess over the chat interface and ignore the qualification pipeline.

3. Quality is not optional

A mediocre AI agent means: (a) users disengage, (b) your brand suffers, (c) you miss the exact opportunities you built this to capture.

What separates an effective system from a failing one isn’t just the AI model but the architecture that supports it.

Working on AI or digital asset strategy? I help operators and brands identify high-leverage opportunities, then build them. Strategic consulting + productized automations.

If any of this resonates, reach out on LinkedIn.

Join for emerging tech insights from strategy to implementation. Trusted by Dior, JPMorgan, and Nike. Because after all, Jeff Bezos found the idea for Amazon via a newsletter just like this.

Neural Foundry

Jan 9

The reranking insight is underrated. Most RAG implementations just grab top-k chunks and hope for the best, but that 15-20% accuracy gain from cross-encoder reranking is exactly where chatbot quality jumps from "kinda helpful" to actually useful. Been working with vector stores for a bit and the latency tradeoff you mentioned (200-500ms) is totally worth it when the alternative is hallucinating nonsense that tanks user trust. Also the point about logging vs real-time extraction is spot on, teams always underestimate how mcuh storage costs explode at scale.

1 reply by Jurgis Pocius

1 more comment...

The Internet Economy

Discussion about this post

Ready for more?