1. Incident Query ➡ Type in a “Car Issue Query” you are having with your car as a query! Use anything like noises, temperature, behavior etc….
2. Run ML Diagnosis ➡ Next, Click “Run ML Diagnosis” and the RAG Model Search Platform will “vectorize” your “Car Issue Query” and compare it against our Table of known car problems below and retrieve the most similar problems and their associated causes and fixes (using cosine similarity values calculated between your query vector and our database of problem description vectors)
3. Add New Car Issue Documentation ➡ Add a new problem to our database of known Problems/Causes/Fixes! By adding a new unique Car Issue to our database, our “RAG Model” can reference it when searching for similar new Queries!
4. Historical Incident Logs ➡ Our “Historical Incident Logs” table below automatically updates and stores your new “Car Issue Problem/Cause/Fix Entry” so it can be referenced with any future queries.

Capital One - RAG Model Search Platform Rapid Prototype Planning & Development

Presentation - 2/11/2026

The Problem

  • The Problem → At Capital One we have no way to quickly access our huge database of “Incident and Resolution Details” which is needed most during time-sensitive job & infrastructure failures
  • C1 has a database of millions of “Historical Incidents” that contain extremely useful information that is very specific to our tech stack
  • All this information is “in-house” → i.e. Not available online or anywhere else
  • All Incident Information is stored on the “ServiceNow” platform
  • Terrible UI / Only Supports basic Table Filtering Functions (no complex querying, matches have to be exact, no Natural Language Questions can be used)
  • Extremely slow filtering to find information
  • 3rd Party Software (we cannot update ourselves)
Demo webapp query and response UI
Prototype UI: natural-language query → diagnosis & resolution response.

The Proposed Solution

The Proposed Solution → Develop a RAG Search Platform that allows for complex natural language querying and response, has much faster search time, and provides drastically more accurate search results.

What is a RAG Model?

Technical Definition: A RAG (Retrieval-Augmentation-Generation) Model is an AI framework that optimizes the output of a Large Language Model (LLM) by allowing it to reference useful context knowledge before generating a response.

Steps Description Specific Use Case at C1
Retrieval When you ask a question, the system searches an external database to find specific, relevant snippets of information.
  • User will submit a question in natural language text to our platform
  • Question text is “vectorized” into a vector of weights to give unique numerical representation of the question
  • Question Vector is passed to our Vector DataBase of historical incidents and we obtain closest matching incidents and their respective root-cause & resolution details
Augmentation The system "augments" your original prompt by attaching the retrieved facts to it, giving the AI "open-book" notes to read. Raw Text of the user question and the retrieved historical information is then passed to Gemini LLM via API (along with a standard prompt text)
Generation The LLM processes both your question and the added context to generate a final answer that is factually grounded in the provided data. Gemini LLM can use this high value & niche “in-house” technical information to return a much more accurate and faster Diagnosis and Solution Response to the technical problem!

Core Benefits

  • Improved Response Accuracy: By forcing the AI to stick to the provided relevant and verified information, it is less likely to "make things up".
  • Up-to-Date Information: You can update the AI's knowledge instantly by simply adding new documents to the database, rather than going through the long process of retraining the entire model.
  • Cost-Effective: It is significantly cheaper and faster to implement a RAG pipeline than to fine-tune or retrain a custom model for a specific business use case.

Why Particularly Useful at C1?

  • As a bank and private company we have strict regulations on keeping information “in house”
  • We have access to Gemini LLM but it cannot access or train on any internal documentation
  • We have lots of useful technical information that we want to share across internal organizations But Not Share Outside Of Capital One (i.e. we can’t post online, we have to build our own internal search platform)

First Steps: Make a Plan and Assign Team Roles

Challenges:
  • Very limited time → We need to make a prototype and get approval in Q4 ASAP if we want to prioritize this as a key "goal" project in 2026
  • Limited resources → We can't use "heavy" AWS resources until we get VP approval to prioritize the project
  • Solution: → Just like with data processing, much faster if we coordinate & distribute the workload!
  • Team Member Assigned Tasks
    1. Manager Leveraged his experience at AWS to:
    1. Determine general architecture and Infrastructure Requirements
    2. Plan for ong-term Scalability
    2. Business Lead Leveraged his finance background to:
    1. Provide projected cost estimates based on assessment of architecture requirements at scale
    2. Complete all access requests needed to obtain as much Historical Incident Data as possible in short time frame
    3. Cyber Security Specialist Leveraged his CyberSecurity skillset to:
    1. Implement SSO authentication to ensure only authenticated Capital One employees have access to platform
    4. Myself (Lead Engineer for this Project)

    Develop a Prototype Model to Demonstrate Value to our Organization

    • A bare-bones (but functional) model was needed ASAP to prove the RAG Model efficacy and utility
    • We needed to present this working model to VP level (as well as scalability outline, cost estimates, etc.)
    • ⟶ Without VP approval, no funding our dedicated time to develop Production Level Platform will be granted
    • My background in Website Design/Deployment and NLP research in Grad School made me a great fit to build a working Bare-Bones WebApp Search Platform to demonstrate to upper management that this is a project worth dedicating time to.

    My Roadmap: Functional Prototype on Short Deadline

    Step 1: Outline RAG Model and Architecture for a Functional end-to-end Platform

    Model needed to be working end-to-end to properly demonstrate functionality and efficacy of a RAG Model.

    Outlining a simple end-to-end Architecture and Work Flow Diagram was critical in having a final result that would be functional (see diagram to the right)

    To reduce prototype development time and cost, I looked for opportunities to simplify architecture components that were not needed at small-scale:

    • → Run a "light-weight" version of sentence-transformer that could be run on a low-cost t2.medium EC2 instance
    • → Perform simple cosine similarity calculations on local EC2
    • → Store Vectors on AWS Vectors S3 (C1 is completely migrated to AWS Cloud, no approval needed for storing incident info)
    • → Use smaller (but high quality) datasets that were easier to manage but still would produce reliable results
    RAG architecture workflow diagram
    End-to-end workflow: Web UI → vector search → Gemini prompt augmentation → response.

    My Roadmap: Data, Vectors, Similarity (Steps 2–4)

    Step 2: Generate Data

    • I used PySpark to handle the initial large-scale raw Incident Historical Information
    • → Deployed PySpark Job to AWS Elastic Container Service to improve data processing efficiency further
    • PySpark was able to quickly read millions of Incident Entries, remove any incidents that were "less than perfect" (e.g. Problem description was too short, not enough detail on how it was resolved, etc..)
    • → Removed any unnecessary columns to reduce complexity of final dataset for testing
    • → Augmented Data by finding entries that contained matching "problem descriptions" but differing "resolution details" (merged these into one entry and kept resolution details from both)
    • Final result was a "Golden Dataset" that contained only useful, distinct, and high quality information, and was small enough to easily manipulate and run RAG Model tests quickly

    Step 3: Vectorize Data

    • → I selected "all-MiniLM-L6-v2" as my sentence transformer because it is designed specifically for resource-constrained environments
    • → Also, only requires the CPU-only version of PyTorch, which is roughly 10x smaller.
    • - This allowed me to vectorize all user questions locally on EC2 in milliseconds.

    Step 4: Calculating Similarity

    • The User Question submitted by frontend UI is vectorized by the sentence transformer.
    • Then, I just opted to a simple cosine similarity check against our VectorDB and store the indices of top 6 matching Historical Incidents
    • Some data cleanup was required to improve accuracy (e.g. keeping text of all entries at approximately similar lengths, adding padding when necessary, etc.)
    • Number of embeddings stored in our vectors was also tweaked to find optimal balance of accuracy and performance time.
    • → However, the lightweight "all-MiniLM-L6-v2" Sentence Transformer model consistently generated accurate results when verifying that the submitted Questions Text would return Historical Incidents with similar key words & description.
    Cosine similarity code snippet
    Cosine similarity scoring (lightweight retrieval).
    Embedding vectors array example
    Stored embeddings (vector database contents).
    Historical incidents table example
    Example incident table: problem → cause → resolution.

    My Roadmap to Achieve Functional Prototype on Short Deadline (Contd.)

    Step 5: Tying our Model to a WebApp Platform

    For sake of development time, I wrote frontend in HTML for static elements and Javascript for any dynamic functionality (e.g. update table, call backend functions etc.)

    Backend was written in Python, Flask was used as WebFrame work bridging Frontend and Backend.

    To keep our platform fast and user friendly, I opted to use AJAX requests to handle shorter tasks while longer tasks were waiting to complete

    AJAX request JavaScript code
    Frontend JavaScript: AJAX calls to backend endpoints and UI updates.

    AJAX for Async Processing

    For example: AJAX request is first made to call our function to submit our prompt to Gemini API (usually takes several seconds to get response). Because AJAX allows for Async processing, we call functions to update our results table while we are waiting for Gemini to return a response.

    My Roadmap to Achieve Functional Prototype on Short Deadline (Contd.)

    

    BTIR Machine Learning Analysis

    Incident Diagnostic & Root Cause Extraction Engine

    System ready. Awaiting input for analysis...

    Historical Incident Logs

    Match %
    (Cosine Similarity)
    Problem Description Root Cause Fix / Resolution
    n/a My Old Honda Accord Won't Start! It's an old car! Jiggle the keys and then turn!
    n/a My Old Honda Accord Won't Start! It's an old car! Jiggle the keys and then turn!
    n/a F150 has door that won't open in freezing cold Door Latch cable gets frozen stuck Bring to Dealership and technician will install improved cable design free of charge!
    n/a My Old Honda Accord Won't Start! It's an old car! Jiggle the keys and then turn!
    n/a Engine stutters and loses power when accelerating Clogged fuel filter or failing spark plugs Replace the fuel filter and install new spark plugs