The Dream Was Simple
I wanted to say "spent 15 dollars on coffee" and have it just... work. No opening an app, no tapping through screens, no picking categories from endless dropdowns. Just speak and move on.
Turns out, making something feel simple requires solving a lot of complex problems behind the scenes.
Why Existing Apps Fail
I've tried most expense tracking apps. They all have the same problem: too many steps between "I bought something" and "it's recorded."
- Open the app
- Tap "Add Expense"
- Enter the amount
- Pick a category
- Maybe add a note
- Save
By step 3, I've already forgotten what else I bought that day. The friction isn't in any single step—it's in all of them combined.
Voice should eliminate this entirely. But voice done poorly is worse than no voice at all. Anyone who's yelled "NAVIGATE HOME" at their car three times knows this pain.
Building the Pipeline
After months of iteration, here's what happens when you tap the microphone in eqva money:
Step 1: Recording
The app captures audio using AAC-LC encoding at 44.1kHz. Nothing fancy here—just clean audio that's small enough to upload quickly. You see a real-time waveform so you know it's listening.
Step 2: Transcription
The audio goes to Groq's Whisper API (whisper-large-v3). This is OpenAI's Whisper model, but Groq runs it fast. Really fast. What used to take seconds now takes milliseconds.
The result: your spoken words as text.
Step 3: The Hard Part—Understanding Intent
Here's where it gets interesting. "Spent fifteen dollars on coffee" is easy. But real human speech is messy:
- "I spent... no wait, it was twenty dollars, at Starbucks"
- "Paid like 50 bucks for groceries, maybe 52"
- "Fifteen dollars for coffee and then another forty for lunch"
People correct themselves. They're uncertain. They mention multiple expenses in one breath.
I use Groq's LLaMA model to parse this chaos into structured data:
{
"amount": 20,
"currency": "USD",
"category": "dining-out",
"merchant": "Starbucks",
"reason": "coffee"
}
The model is specifically prompted to detect corrections ("no wait", "actually", "I mean") and use only the final value. This one detail eliminated so many duplicate entries.
The Currency Problem
This one took me a while to figure out.
When someone says "fifty dollars," which dollars? US? Australian? Canadian? Singapore?
The answer depends on context. If your default currency is AUD, "dollars" probably means Australian dollars. If you're in India and say "dollars," you probably mean USD because you're explicitly not saying "rupees."
I built a currency context system that maps 59 currency families. When the LLM processes your speech, it knows your default currency and makes intelligent guesses:
- Default currency is INR → "dollars" = USD (you'd say "rupees" for INR)
- Default currency is USD → "dollars" = USD
- Default currency is AED → "dollars" = USD (you'd say "dirhams" for AED)
It's not perfect, but it's right about 95% of the time. And you can always edit before saving.
Category Intelligence
Nobody says "I spent money on food and beverage, subcategory caffeinated drinks." They say "coffee" or "Starbucks" or "my morning fix."
I mapped over 100 natural language variations to 23 app categories:
- "uber", "lyft", "taxi", "cab" → Transportation
- "netflix", "spotify", "hulu" → Subscriptions
- "coffee", "starbucks", "cafe" → Dining Out
- "rent", "mortgage", "lease" → Housing
The LLM does the initial categorization, and this mapping catches common patterns it might miss.
Payment Source Matching
This was a surprise feature that emerged from user testing.
People say things like "paid with my HDFC card" or "put it on the Amex." If you've added your payment sources to the app, eqva money automatically matches these mentions.
The matching algorithm scores potential matches based on:
- Exact name matches (60 points)
- Bank name mentions (40 points)
- Card type keywords (30 points)
- Last 4 digits if mentioned (50 points)
So "paid with my Chase Sapphire" correctly links to your Chase Sapphire card without you having to select it manually.
The State Machine
Behind the scenes, every voice recording moves through a state machine:
idle → recording → processing → extracting → reviewing → saving → idle
Each state has its own UI and error handling. If transcription fails, you can retry. If extraction produces something weird, you can edit. If you change your mind, you can cancel.
The "reviewing" state is crucial. You see exactly what the AI extracted before anything is saved. Swipe to edit, swipe to delete, tap to confirm. Trust but verify.
What I Learned from Speech to Note
Building Speech to Note taught me that voice interfaces need three things:
- Speed: If processing takes more than 2-3 seconds, it feels broken
- Accuracy: 90% accuracy isn't good enough—that's one error every ten entries
- Graceful recovery: When it fails (and it will), make fixing easy
eqva money inherits all these lessons. Groq's infrastructure handles speed. The LLM prompt engineering handles accuracy. The review screen handles recovery.
What's Next
Voice is never "done." I'm working on:
- Better accent handling: Testing with more diverse speech patterns
- Multi-language support: Starting with Spanish and Hindi
- Learning your patterns: If you always buy coffee at the same place, pre-fill accordingly
The goal remains the same: speak naturally, get accurate results, move on with your life.
This is part of my "Building in Public" series where I share the technical decisions behind eqva money.