Enterprise RAG

Connecting AI to private IP
without breaching security.

You do not secure AI by asking the AI to be secure. You secure AI by wrapping the generation engine in rigid, deterministic software architecture.

The Fine-Tuning Fallacy

When developers are tasked with making an LLM "know" their company data, their first instinct is almost universally wrong: "Let's fine-tune an open-source model on our internal documents."

Fine-tuning alters the parametric memory of a model—the actual weights and biases of the neural network. It is designed to teach a model how to behave (e.g., talk like a pirate, output JSON). It is a highly destructive and inefficient way to teach a model factual knowledge.

The Parametric Trap

The Deletion Problem: If an HR policy changes, you cannot simply `DELETE` a neuron. The outdated fact is baked into the weights.
Hallucination: Models interpolate facts. It might blend the 2022 and 2024 revenue numbers together.
No RBAC: A neural network cannot check user roles. If it knows the CEO's salary, it will tell anyone who asks.

The Non-Parametric Solution

RAG Architecture: Retrieval-Augmented Generation relies on external memory (a database).
CRUD Operations: You can instantly update, delete, or add documents to the Vector DB without touching the AI.
Strict Permissions: You filter the database query before the AI ever sees the data.

Phase I

Boundary Security & APIs

Before data is ingested, you must ensure your LLM endpoint is legally bound not to train on your payloads. If you use standard consumer APIs, your corporate intellectual property becomes part of the global training dataset.

Anti-Pattern Consumer Tier Endpoint


                        POST https://api.openai.com/v1/chat/completions

                        // Risk: Data is retained for 30 days and may be used for model training depending on terms.

Production Standard Enterprise Tier Endpoint


                        POST https://{company-name}.openai.azure.com/openai/deployments/gpt-4/...

                        // Secure: Hosted within your VPC. Zero Data Retention (ZDR) legal agreement.

Phase II

Semantic Chunking & Overlap

LLMs have finite Context Windows. You cannot pass a 5,000-page operational manual into a prompt. The data must be broken into "chunks." However, naive chunking (e.g., splitting strictly every 500 words) will slice a sentence in half, destroying the semantic meaning.

The Overlap Strategy

To preserve context, professional ETL pipelines implement Semantic Chunking with Overlap. If Chunk 1 ends, Chunk 2 will repeat the last 50-100 tokens of Chunk 1. This ensures that a concept bridging two paragraphs is not lost.

Chunk 1 (Tokens 0-500): "...the core requirement for the new API integration is that it utilizes Oauth2. The refresh token expires every 24 hours."

Chunk 2 (Tokens 450-950): "The refresh token expires every 24 hours. Upon expiration, the client must request a new token using the secret key..."

Phase III

Vector Mathematics & Search

How do we search these chunks? Keyword search (BM25) fails when vocabularies don't match (e.g., user searches "vacation," document says "PTO"). We solve this using Embeddings.

An embedding model (like `text-embedding-3-small`) translates a text chunk into a high-dimensional vector (e.g., an array of 1,536 numbers). These numbers represent the semantic location of that concept in multi-dimensional space.

// Vector Representation of the word "King" [ 0.0124, -0.0553, 0.9921, 0.1124, -0.5512, 0.8841, 0.0012, -0.7712 ... 1528 more dimensions ]

Cosine Similarity

When a user asks a question, the question is also vectorized. The database then calculates the mathematical distance (Cosine Similarity or Dot Product) between the question vector and all document vectors. The vectors that are "closest" together in space are returned as the most relevant chunks.

Phase IV

Grounding the Prompt

Once the top chunks are retrieved, they are injected into a rigid System Prompt. The AI is effectively given an open-book test and explicitly commanded not to use its own parametric training data.

PRODUCTION PROMPT

You are an internal corporate assistant.
Your task is to answer the user's question using ONLY the information provided in the <context> tags below. 

# STRICT CONSTRAINTS:
1. If the answer cannot be deduced from the context, you must output EXACTLY: "I do not have enough context to answer this." Do NOT guess.
2. You must cite your source using the [DocID] attached to the context chunk.

<context>
{injected_vector_chunks_go_here}
</context>

User Query: {user_input}

Phase V

Role-Based Access Control (RBAC)

If you stop at Phase IV, you have built a massive security vulnerability. The AI can summarize any document in the Vector DB. To prevent a junior developer from querying the CFO's strategic layoff plans, you must implement Metadata Filtering.

When creating embeddings during Phase II, you must attach metadata (JSON objects) to every chunk. Before the Vector DB performs the Cosine Similarity search, it executes a hard SQL-like pre-filter.

1. Database Storage

{
  "id": "chunk_992",
  "vector": [0.12, -0.4...],
  "metadata": {
    "doc_type": "payroll",
    "clearance_level": 5,
    "department": "finance"
  }
}

2. Secure Query Execution

// JWT token confirms User is Level 2

index.query({
  vector: user_query_embedding,
  top_k: 5,
  filter: {
    "clearance_level": { "$lte": 2 }
  }
});

Because the metadata filter is applied at the database level, the CFO's document is never mathematically evaluated and is completely invisible to the LLM when the junior developer asks a question.

Connecting AI to private IP without breaching security.