Skip to main content

6 posts tagged with "Gen AI at AAA"

AAA

View All Tags

aaa

· One min read

First GenAI hire brought in to architect and scale the enterprise GenAI program, partnered with product leadership to launch 7 product lines and deliver $4.8M in annual operational savings. Architected & scaled the flagship roadside assistance platform from 0 to 1, serving 60K+ DAUs, and driving a 44% increase in accuracy across AI-powered interactions.

The Engineering That Made 60,000 Emergency Requests Possible

· 5 min read

In emergency assistance, latency is a metric signifying liability.

We were building a multi-agent AI system for roadside assistance. Voice agents, supervisor agents, dispatch agents - all orchestrating in real-time to help stranded drivers.

On paper, the initial architecture looked clean. Each agent had its own database connection, authentication flow, and lifecycle. Clear separation of concerns. Easy to reason about and easy to ship.

But the math didn’t hold for a system designed to serve 60,000 daily users.

Where The Maths Breaks

Before a single byte of useful data moves, every database connection pays a fixed cost:

TCP Handshake: ~30ms
TLS Negotiation: ~50-100ms
Auth & Session Setup: ~30-50ms
Total overhead: ~150-200ms per connection.

In our initial design, a single user emergency could trigger 20 sequential agent actions. If each agent opened its own connection, that’s 3-4 seconds of pure network overhead per user. In emergency assistance, 4 seconds is an eternity.

At full theoretical load, over a million connection attempts daily would push memory requirements into terabyte scale.

Even after accounting for realistic concurrency, connection overhead alone would consume a significant portion of available memory, exhausting our database threads and memory buffers long before we hit peak traffic.

Thus, with independent connections per agent, the system cascades failure as:

  1. Agents compete for connections
  2. Connection pools saturate
  3. Requests queue behind slower queries
  4. Timeouts trigger retries
  5. Retries amplify load

The Pivot: From Isolation to Agentic Concurrency

I thought of the problem statement from the perspective of autonomous vehicle coordination for inspiration, where self-driving cars share a unified, real-time state of the map they are driving on.

Similarly, our agents don’t act in isolation, rather in a swarm. They were part of the same workflow, operating on the same user context, within the same time window.

When a user says "I have a flat tire," the Voice Agent, Context Agent, and Dispatch Agent all need data simultaneously. Here, the database connection becomes a shared highway.

If they fight for separate connections from a standard pool, they create head-of-line blocking. They wait. They timeout. They fail.

Standard connection pooling helps, but it doesn’t solve the core issue: Agentic Concurrency.

The Solution: Intelligent Multiplexing

We built a shared connection layer that allowed multiple agents to pipeline queries over a single, persistent TLS tunnel. We moved from a "request-per-connection" model to a "session-per-workflows" model.

1. Super-connection (multiplexing) 

Multiple agents can send queries over the same TCP/TLS tunnel simultaneously. When the first agent (Voice) connects, it establishes the TLS handshake and auth. Subsequent agents (Context, Dispatch) don’t open new connections; they borrow the existing secure channel.

2. Query Pipelining (Removing Head-of-Line Blocking)

Standard pools are FIFO (First-In, First-Out). If Agent A is slow, Agent B waits.

Asynchronous query pipelining over the shared connection ensures agents send their database requests into a prioritized queue on the client side. The database processes them as fast as it can, returning results out-of-order if necessary.

For instance, Agents no longer block each other. A slow "log this event" query doesn’t delay a critical "get user location" query.

3. Intelligent Lifecycle Management

Resource leaks are prevented while maximizing reuse during the critical window of the emergency, by tying the connection lifecycle to the user’s emergency session, not the individual agent’s execution time.

The Impact:

MetricBefore (Isolated)After (Multiplexed)Improvement
Round Trips120-150 per workflow4-7 per workflow95% Reduction
Latency Overhead~4,000ms per workflow~20ms amortized99.5% Reduction
Memory Usage~40MB per concurrent workflow~2MB per concurrent workflow95% Reduction

We didn’t just save time. We changed the complexity class of the system.

Before: O(agents × users × connection overhead) After: O(active workflows × shared connections)

The Human Outcome

Engineering decisions stay abstract until they hit the road. Here’s what those milliseconds actually bought us:

  • 2-3 Minutes Faster Response: By shaving 4 seconds off every interaction, we accelerated the entire dispatch chain. In urban traffic, that’s the difference between a tow truck arriving before rush hour peaks or after.
  • 99% Fewer Failures: Connection storms cause timeouts. Timeouts cause retries. Retries cause cascading failures. By stabilizing the network layer, we stabilized the user experience.
  • Scalability Without Panic: When we hit regional disasters (spikes to 10x normal load), the system would scale linearly. We didn’t need to throw hardware at the problem; we had already solved it in software.

The Lesson: Complexity is a Choice

It’s tempting to let architecture drift. To let each microservice own its own connections, its own configs, its own chaos. It feels "clean" in the short term.

But in high-stakes systems, local decisions compound.

Treating connections as shared, high-cost primitives made the system faster and viable at scale.

Why We Chose the Hard Path of Building Our AI Stack

· 6 min read

It was 2024. The pressure to buy was immense.

Every week, a new vendor would pitch us. The message was seductive in its simplicity: “You don’t want to build this since it’s hard and messy. Let us handle it.”

Honestly, building from ground up can be expensive. It keeps you up at night.

Yet, we chose to build anyway.

Looking back from 2025, I know this wasn’t the obvious choice. In fact, for most companies, it would have been a mistake. But for us, in that specific window of time, it was the only way to survive. Here is why we did it, what it cost us, and what we learned about the true price of ownership.

The Landscape Was Empty

What people forget about early 2024 is how little actually existed.

LangChain was evolving weekly. Bedrock had just launched. Most “production-ready” demos were smoke and mirrors, often just polished abstractions of basic Q&A.

We had a specific problem: a sixteen-step emergency roadside workflow. We needed a system that could route between specialized agents without losing context, trigger frontend behaviors like GPS detection, and render maps—all without forcing us into a rigid API contract before we even knew what the product looked like.

The vendors couldn’t do this, because they were building for the average user. They were optimized for horizontal chatbots. We needed deep, vertical orchestration. The open-source ecosystem wasn’t ready either.

We stood at a crossroads. Wait for the market to mature (and lose our first-mover advantage), or build it ourselves (and risk burning our runway).

We chose the risk.

What We Built Instead

and What Broke Along the Way

At the time, everyone treated LLMs as chat interfaces. We had to treat them as orchestration primitives.

We built a multi-agent system where agents could share full conversational context. This meant a specialist agent could take over mid-conversation without the user repeating themselves. It sounds simple now, but in 2024, it was fragile. It broke often. We spent weeks debugging context windows that leaked memory and prompts that hallucinated steps.

But when it worked, it unlocked a lot.

We built an orchestrator-supervisor agent with one main job, that is, to decide where the conversation should go.

General query? → Knowledge base agent.
Emergency? → Specialized roadside agent with execution tools.
(and other use cases)

This allowed us to model an entire workflow that completed tasks, while also answering questions.

The Protocol That Didn't Exist

The system needed to trigger browser behaviors, that is, request GPS access, render maps, display nearby service options.

A traditional approach would require tightly coupled API contracts between backend and frontend. We avoided that entirely to avoid getting slowed down to a crawl, and also prevent our agents from that.

Instead, we embedded structured markers within the model’s output stream. In this unelegant design, the frontend listened for these signals and reacted in real time, without any rigid schema, versioned contracts or coordination overhead. Hence, the interface became responsive to the model, not dependent on it.

The solution was brittle as we were coupling our UI logic to the output of a probabilistic model. Any seasoned engineer would raise an eyebrow. In 2024, elegance was a luxury. We traded engineering purity for product velocity. We accepted the technical debt because it allowed us to ship a feature that felt magical to users—weeks before our competitors could even draft their API specs.

(Note: Today, with mature frameworks and standardized JSON modes, we would approach this differently. But back then, this hack was our lifeline.)

The Fear of Lock-In

While we hacked the UI layer, we refused to hack the model layer.

We were terrified of getting stuck with a single provider. In 2024, vendors weren’t just selling tools; they were trying to own our stack. They pushed their models, their embeddings, their proprietary formats. We knew that if we leaned too hard into one provider, we’d lose our ability to pivot.

So, unlike the UI layer, we built a clean, thin abstraction layer for model access that served as our escape hatch.

We encapsulated model access and prompt handling behind a strict interface. This meant that when Model A became too expensive, or Model B suddenly got smarter at reasoning, we could switch traffic in hours.

While competitors were locked into six-month migration projects to switch providers, we were testing new models in production on a Tuesday afternoon. That agility wasn’t a feature; it was our survival mechanism.

The Trade-off

This wasn’t the obvious choice.

Building meant owning complexity: agent coordination and orchestration, prompt design, streaming behavior, optimization through evaluation and observability, and constant iteration in an unstable ecosystem. It required time, focus, and a willingness to operate without established patterns.

Along the process, we made mistakes, expanded team to more engineers and still burned out, disrupted our work-life balance.

For many teams, buying would have been the right decision. It definitely saves a lot of time and sanity. If your use case is standard, please, just buy.

In our case, the requirements were too specific and the pace of change was too fast. The constraints made the decision clear.

The Outcome: Autonomy Over Savings

Because we built, we bought ourselves something money can’t easily purchase: autonomy.

Yes, we saved ~$500k in vendor license fees. But, building isn't free either. It is paid for in engineering salaries and late nights.

The real financial win was the efficiency of spend. We optimized compute in ways vendors never would have allowed, because their margins depended on inefficiency.

The bigger win was avoiding the hidden tax of dependence:

  • No waiting for external roadmaps.
  • No negotiating contracts during peak traffic.
  • No hitting rate limits that killed our user experience.

We scaled on our own terms: our accounts, our limits, our decisions.

The choice between building and buying.

Although buying optimizes for speed today, building, when done for the right reasons, optimizes for control tomorrow.

Spending the complexity upfront, gave us leverage and a foundation that is entirely ours.

If you’re sitting on the fence today, ask yourself:

  • Is this feature your core differentiator?
  • Are you spending time building something that can be easily delegated?
  • Does your vendor understand your compliance needs, or will they become a bottleneck?

If the answer is "no" to any of these, buy.
Don’t be afraid to build if your survival depends on specificity and speed.

Fifteen Lambdas, Zero Deviation, and 4k Developer Hours Saved

· 5 min read

Nineteen Lambdas, Zero Deviation

We had nineteen Lambda functions. They all did different things. They all behaved the same way.

That wasn't an accident. It was enforcement.

The problem started like this: five functions, then eight, then twelve. Each written by a different person on a different day. One logged errors with stack traces. Another logged nothing at all.
One retried failed API calls three times. Another gave up immediately.
One returned CORS headers on every response. Another forgot them on errors—which meant browsers silently swallowed failures and left users staring at a frozen screen.

The fixes were easy. The pattern was not.

We were building a system where

  • every new function meant re-learning how to do the basics.
  • on-call required memorizing nineteen different behaviours.
  • bugs fixed in one place would still exist in countless others.

So I built a framework.


The Framework

Every new Lambda extended it. Override generateResponse() with your business logic. Maybe initializeHandler() if you needed setup. That's it.

The parent class handled everything else.

  • event parsing
  • logger initialization
  • execution timing
  • retry logic with exponential backoff
  • error classification
  • response formatting with CORS headers
  • request tracing via X-Ace-RequestId that linked frontend errors to backend logs—automatically present on every response, success or failure.

A developer writing a new action group didn't think about any of this.

  • If they forgot to log something, it was logged.
  • If they threw an error, it was caught and classified.
  • If a transient failure occurred, it retried automatically.

All they had to do now was write the code for the feature's business logic, and the framework completed the rest.

The first Lambda written this way worked on the first deploy.
So did the second. So did the next five, and subsequently others.


Two Arrays That Did the Work of Ten Engineers

The classification lived in an error-handling library. Two arrays.

const no_retry_exceptions = [
'ValidationException',
'AccessDeniedException',
'ResourceNotFoundException'
];

const retry_exceptions = [
'ThrottlingException',
'ServiceQuotaExceededException',
'InternalServerException',
'ConflictException',
'DependencyFailedException'
];

That's it.

Validation errors don’t retry as that code wouldn't fix on its own.
Throttling errors retry once, with backoff, because hammering an overloaded service only makes it worse.

When a new error type appeared, we updated the arrays once.
Nineteen Lambdas updated instantly.

No hunting. No meetings. No “did we get them all?”


Twenty-One Lines That Made a Class of Bugs Extinct

In the inheritted MongoDB's implementation of the $set operator, a nested object doesn't merge, rather replaces. You think you’re updating a field. You’re actually deleting everything else.

For instance, if you had { user: { name: "John", phone: "555-1234" } } and you ran { $set: { user: { phone: "555-9999" } } }, you'd get { user: { phone: "555-9999" } }. The name vanished. Permanently.

This is documented. It's also a trap every MongoDB developer steps in eventually.

I wrote flattenObject. Twenty-one lines. It turned nested objects into dot notation—{ "user.phone": "555-9999" }.

Now $set updated exactly one field.

Then I put it in the shared MongoDbClient. Every Lambda that wrote to the database inheritted this protection automatically. No one had to remember the rule, and no one had to lose data again.


Cold Starts That Stopped Mattering

Creating AWS clients is expensive. TLS handshakes. Credential resolution. Connection pools. Three hundred milliseconds here, eight hundred there.

On a cold start, that time adds up.

So we initialized clients at module scope. If that failed, we lazily initialized inside the handler on the first request. That first request is slightly slower. It doesn't crash. Every request after that is fast.

We reduced p95 latency by ~800ms, by removing a class of latency entirely.


What Nineteen+ Lambdas Look Like Now

We have nineteen. The pattern held.

A new developer joined and added an action group. He wrote his business logic in data structures already being passed to the framework, and shipped in a few hours.

He didn’t have to ask about:

  • logging
  • retries
  • error handling
  • CORS
  • request IDs

His code worked on the first deploy because of the framework.

Most teams optimize for flexibility early. They pay for it later: in inconsistency, bugs, and cognitive load.

I did the opposite. I constrained everything that didn’t need variation. When behavior is standardized, correctness compounds.

Concluding

Two weeks to build the framework.
Thousands of hours saved since.
Tending to Zero data-loss events from third party clients.
Entire classes of bugs eliminated.
New engineers productive on day one.
Velocity 100x.

I build systems where the right behavior is the default, and everything else is secondary.

P.S. Across the system’s lifetime, this pattern has saved an estimated ~4,000 developer hours—and continues to compound by eliminating repeated fixes, debugging effort, and inconsistent behavior across services.

The Invisible Integration Architecture

· 3 min read

When most developers see tickets on an agile board, they see tasks. I see leverage, often starting from point -1.

Our company had dozens of digital properties: membership portals, insurance dashboards, travel booking engines, branch locators. Each operated as its own kingdom: different stacks, teams, release cycles, and definitions of “urgent.

Every new capability required negotiating entry into each one. This is the kind of friction that masquerades as process in high-bureaucracy environments: discovery calls, integration debates, version mismatches, bandwidth constraints.

A simple feature, like a universal point of ingestion to our generative AI roadside assistant, turned into months of coordination. One property would launch. The rest would wait.

I realized we were building a product that required adoption.

The problem was territorial.

The Architecture of a Parasite

I built the platform to behave like a parasite. Without asking for permission or negotiating entry, it attaches itself where traffic already exists and begins operating within the host, without requiring the host to change.

Technically, this meant a single script that:

  • initializes its own DOM root if none exists
  • detects the environment via domain/runtime signals
  • routes to the correct backend without configuration
  • operates independently of the host’s stack or release cycle

Handling session context across properties without direct integration required reconstructing user state from fragmented signals—cookies, local storage, URL parameters—and resolving them via backend identity mapping, all without relying on host-provided authentication hooks.

The Trade-offs That Made It Viable

This model prioritizes survival and sacrifices flexibility.

  • No per-team customization surface to slow decisions.
  • Limited visibility into internals. (UI, state, and logic were self-contained).
  • Reduced control for host teams (environment inference via domain-based routing)

It also required strong guardrails to avoid interfering with host behavior.

But in early-stage systems—especially in uncertain domains like GenAI—distribution is the primary constraint. Removing friction mattered more than enabling control.

This reduced integration from a multi-week process to a ~30-second decision.

The Outcome Economics

Over the next six months, the assistant appeared on over most properties.

There were no integration meetings, no onboarding docs, and almost no support requests. Teams adopted it independently because the cost to try was negligible and rollback was trivial.

"Add this script tag."
"That's it?"
"That's it."

What normally takes weeks of coordination became a same-day decision. Adoption worked because the value was immediately visible to end-users.

The Hidden Complexity

This simplicity was deliberate. Descoping became the core design tool. Every configuration option, like theming or customization, was a potential bottleneck disguised as flexibility. So it was removed.

All updates shipped centrally and became instantly available everywhere, without requiring any action from host teams.

The system knew how to find what it needed. The host didn’t need to know anything.

The Lesson

A platform requires upfront investment but compounds in value through adoption. Its scalability is determined by how easily it propagates.

Most platforms behave like guests: waiting to be invited, configured, and maintained. They fail when they require effort to adopt.

This one behaved differently. It attached itself, adapted quietly, and delivered value unobtrusively.

Conclusive Thought

If you want something to spread across an organization, don’t optimize for configurability. Optimize for inevitability.

The most effective systems don’t demand adoption. They make it unavoidable.

That’s the kind of leverage I aim for, that is, build once, and remove the need for others to build at all.

Why Most RAG Pipelines Fail in Production (and the Fix)

· 2 min read

Over the past few months, I had the opportunity to refine a RAG ingestion pipeline that governs how our AI assistant’s knowledge base evolves. It involves syncing content from S3 into OpenSearch, updating embeddings, and promoting new knowledge safely into production.

The core challenge wasn’t ingestion itself, but ensuring retrieval consistency against cross-query temporal hallucination due to stale context. In early iterations, the system would occasionally serve a mix of stale and fresh data, leading to subtle hallucinations where responses looked correct but contradicted each other across queries. This made failures hard to detect and even harder to debug.

My initial approach used a cron-triggered Lambda. It broke under real-world conditions: concurrent runs collided, partial failures corrupted the index, and transient errors required manual retries. A teammate spent 2–3 hours weekly recovering failed syncs and auditing responses. More importantly, traditional backend reliability metrics (job success/failure) didn’t capture the real problem, that is, semantic correctness.

I redesigned the workflow as an AWS Step Functions state machine that enforces mutual exclusion, queues overlapping runs, and performs atomic index promotion. Instead of incrementally updating the live index, the system builds a complete new snapshot and only switches the agent once the entire pipeline of [ ingestion, chunking, embedding ] succeeds. This trades off some freshness latency and compute cost for strong consistency, which proved to be the more critical lever for correctness.

To measure impact, we tracked stale-answer incidents using a mix of regression queries and manual audits, rather than relying solely on pipeline health metrics. This reduced inconsistent-answer cases by ~93% and eliminated routine manual intervention, allowing the team to shift focus from operational recovery to improving model quality.

If I were to rebuild it today, I’d extend the system with AI-native safeguards: an automated evaluation harness to detect semantic drift before promotion, versioned rollbacks for instant recovery, and embedding quality monitoring to catch degradation over time.

The key insight from this work is that in production RAG systems, data consistency often matters more than data freshness, and the ingestion pipeline itself becomes part of the model’s behavior surface, so it needs to be designed and evaluated with the same rigor as the model.