Capital One - RAG Model Search Platform Rapid Prototype Planning & Development

Presentation - 2/11/2026

The Problem

The Problem → At Capital One we have no way to quickly access our huge database of “Incident and Resolution Details” which is needed most during time-sensitive job & infrastructure failures
C1 has a database of millions of “Historical Incidents” that contain extremely useful information that is very specific to our tech stack
All this information is “in-house” → i.e. Not available online or anywhere else
All Incident Information is stored on the “ServiceNow” platform
Terrible UI / Only Supports basic Table Filtering Functions (no complex querying, matches have to be exact, no Natural Language Questions can be used)
Extremely slow filtering to find information
3rd Party Software (we cannot update ourselves)

Prototype UI: natural-language query → diagnosis & resolution response.

The Proposed Solution

The Proposed Solution → Develop a RAG Search Platform that allows for complex natural language querying and response, has much faster search time, and provides drastically more accurate search results.

What is a RAG Model?

Technical Definition: A RAG (Retrieval-Augmentation-Generation) Model is an AI framework that optimizes the output of a Large Language Model (LLM) by allowing it to reference useful context knowledge before generating a response.

Steps	Description	Specific Use Case at C1
Retrieval	When you ask a question, the system searches an external database to find specific, relevant snippets of information.	User will submit a question in natural language text to our platform Question text is “vectorized” into a vector of weights to give unique numerical representation of the question Question Vector is passed to our Vector DataBase of historical incidents and we obtain closest matching incidents and their respective root-cause & resolution details
Augmentation	The system "augments" your original prompt by attaching the retrieved facts to it, giving the AI "open-book" notes to read.	Raw Text of the user question and the retrieved historical information is then passed to Gemini LLM via API (along with a standard prompt text)
Generation	The LLM processes both your question and the added context to generate a final answer that is factually grounded in the provided data.	Gemini LLM can use this high value & niche “in-house” technical information to return a much more accurate and faster Diagnosis and Solution Response to the technical problem!

Core Benefits

Improved Response Accuracy: By forcing the AI to stick to the provided relevant and verified information, it is less likely to "make things up".
Up-to-Date Information: You can update the AI's knowledge instantly by simply adding new documents to the database, rather than going through the long process of retraining the entire model.
Cost-Effective: It is significantly cheaper and faster to implement a RAG pipeline than to fine-tune or retrain a custom model for a specific business use case.

Why Particularly Useful at C1?

As a bank and private company we have strict regulations on keeping information “in house”
We have access to Gemini LLM but it cannot access or train on any internal documentation
We have lots of useful technical information that we want to share across internal organizations But Not Share Outside Of Capital One (i.e. we can’t post online, we have to build our own internal search platform)

First Steps: Make a Plan and Assign Team Roles

Challenges:

Very limited time → We need to make a prototype and get approval in Q4 ASAP if we want to prioritize this as a key "goal" project in 2026

Limited resources → We can't use "heavy" AWS resources until we get VP approval to prioritize the project

Solution: → Just like with data processing, much faster if we coordinate & distribute the workload!

Team Member	Assigned Tasks
1. Manager	Leveraged his experience at AWS to: Determine general architecture and Infrastructure Requirements Plan for ong-term Scalability
2. Business Lead	Leveraged his finance background to: Provide projected cost estimates based on assessment of architecture requirements at scale Complete all access requests needed to obtain as much Historical Incident Data as possible in short time frame
3. Cyber Security Specialist	Leveraged his CyberSecurity skillset to: Implement SSO authentication to ensure only authenticated Capital One employees have access to platform
4. Myself (Lead Engineer for this Project)	Develop a Prototype Model to Demonstrate Value to our Organization A bare-bones (but functional) model was needed ASAP to prove the RAG Model efficacy and utility We needed to present this working model to VP level (as well as scalability outline, cost estimates, etc.) ⟶ Without VP approval, no funding our dedicated time to develop Production Level Platform will be granted My background in Website Design/Deployment and NLP research in Grad School made me a great fit to build a working Bare-Bones WebApp Search Platform to demonstrate to upper management that this is a project worth dedicating time to.

My Roadmap: Functional Prototype on Short Deadline

Step 1: Outline RAG Model and Architecture for a Functional end-to-end Platform

Model needed to be working end-to-end to properly demonstrate functionality and efficacy of a RAG Model.

Outlining a simple end-to-end Architecture and Work Flow Diagram was critical in having a final result that would be functional (see diagram to the right)

To reduce prototype development time and cost, I looked for opportunities to simplify architecture components that were not needed at small-scale:

→ Run a "light-weight" version of sentence-transformer that could be run on a low-cost t2.medium EC2 instance
→ Perform simple cosine similarity calculations on local EC2
→ Store Vectors on AWS Vectors S3 (C1 is completely migrated to AWS Cloud, no approval needed for storing incident info)
→ Use smaller (but high quality) datasets that were easier to manage but still would produce reliable results

End-to-end workflow: Web UI → vector search → Gemini prompt augmentation → response.

My Roadmap: Data, Vectors, Similarity (Steps 2–4)

Step 2: Generate Data

I used PySpark to handle the initial large-scale raw Incident Historical Information
→ Deployed PySpark Job to AWS Elastic Container Service to improve data processing efficiency further
PySpark was able to quickly read millions of Incident Entries, remove any incidents that were "less than perfect" (e.g. Problem description was too short, not enough detail on how it was resolved, etc..)
→ Removed any unnecessary columns to reduce complexity of final dataset for testing
→ Augmented Data by finding entries that contained matching "problem descriptions" but differing "resolution details" (merged these into one entry and kept resolution details from both)
Final result was a "Golden Dataset" that contained only useful, distinct, and high quality information, and was small enough to easily manipulate and run RAG Model tests quickly

Step 3: Vectorize Data

→ I selected "all-MiniLM-L6-v2" as my sentence transformer because it is designed specifically for resource-constrained environments
→ Also, only requires the CPU-only version of PyTorch, which is roughly 10x smaller.
- This allowed me to vectorize all user questions locally on EC2 in milliseconds.

Step 4: Calculating Similarity

The User Question submitted by frontend UI is vectorized by the sentence transformer.
Then, I just opted to a simple cosine similarity check against our VectorDB and store the indices of top 6 matching Historical Incidents
Some data cleanup was required to improve accuracy (e.g. keeping text of all entries at approximately similar lengths, adding padding when necessary, etc.)
Number of embeddings stored in our vectors was also tweaked to find optimal balance of accuracy and performance time.
→ However, the lightweight "all-MiniLM-L6-v2" Sentence Transformer model consistently generated accurate results when verifying that the submitted Questions Text would return Historical Incidents with similar key words & description.

Cosine similarity scoring (lightweight retrieval).

Stored embeddings (vector database contents).

Example incident table: problem → cause → resolution.

My Roadmap to Achieve Functional Prototype on Short Deadline (Contd.)

Step 5: Tying our Model to a WebApp Platform

For sake of development time, I wrote frontend in HTML for static elements and Javascript for any dynamic functionality (e.g. update table, call backend functions etc.)

Backend was written in Python, Flask was used as WebFrame work bridging Frontend and Backend.

To keep our platform fast and user friendly, I opted to use AJAX requests to handle shorter tasks while longer tasks were waiting to complete

Frontend JavaScript: AJAX calls to backend endpoints and UI updates.

AJAX for Async Processing

For example: AJAX request is first made to call our function to submit our prompt to Gemini API (usually takes several seconds to get response). Because AJAX allows for Async processing, we call functions to update our results table while we are waiting for Gemini to return a response.