DP-800 Study Guide
Developing AI-Enabled Database Solutions
Exam DP-800 at a Glance
Understand what this certification is, who it's for, and how the exam is structured before diving into content.
What is DP-800?
Exam DP-800 β Developing AI-Enabled Database Solutions β is a Microsoft Associate-level certification exam that validates expertise in integrating AI directly within SQL-based data solutions across the Microsoft SQL platform family: Azure SQL Database, SQL Server 2025, and SQL in Microsoft Fabric.
The exam focuses on the new generation of SQL AI capabilities β native vector types, in-database embedding generation, semantic search, and retrieval-augmented generation (RAG) β all executed using T-SQL without migrating data to a separate vector database.
Target Audience
- Database developers integrating AI into SQL-based applications
- Data engineers building AI-augmented pipelines on Azure SQL or Microsoft Fabric
- Solution architects designing secure, scalable AI-enabled database solutions
- Professionals experienced with T-SQL, GitHub-based CI/CD, and foundational AI/ML concepts
Exam Structure
| Skill Area | Topics | Approx. Weight |
|---|---|---|
| 1. AI-Ready Database Design | Schema design, T-SQL for AI, data types | ~20% |
| 2. Vectors & Embeddings | VECTOR type, DiskANN index, VECTOR_DISTANCE | ~25% |
| 3. RAG Workflows | Retrieval-Augmented Generation, LLM grounding | ~20% |
| 4. T-SQL AI Functions | AI_GENERATE_EMBEDDINGS, external models, OpenAI | ~20% |
| 5. Security, Scaling & DevOps | Encryption, RBAC, CI/CD, Data API Builder | ~15% |
Recommended Prerequisites
- Solid T-SQL knowledge (queries, stored procedures, indexes, performance tuning)
- Familiarity with Azure SQL Database or SQL Server
- Basic understanding of AI/ML concepts: embeddings, vectors, LLMs
- Experience with GitHub and CI/CD pipelines
- Knowledge of database security best practices
The exam includes scenario-based questions requiring you to choose the most appropriate T-SQL approach for AI tasks. Focus on understanding when and why to use each feature, not just syntax. Practice with the Microsoft Learn labs before sitting the exam.
Designing and Developing AI-Ready Database Solutions
Learn how to design schemas, choose appropriate data types, and write advanced T-SQL that efficiently supports AI workloads.
1.1 Schema Design for AI Workloads
AI-ready schemas differ from traditional schemas: they must efficiently store high-dimensional vector data, support partial-text search, and handle semi-structured (JSON) responses from LLMs.
Key Design Principles
- Store embeddings close to source data β add a
VECTORcolumn to the same table as the text being represented, avoiding expensive joins at query time. - Use JSON columns for LLM metadata β store prompt/response logs, token counts, and model metadata as
NVARCHAR(MAX)with JSON validation. - Chunk large documents β split long text into overlapping chunks (e.g., 512 tokens) and store each chunk as a separate row with its embedding and a reference to the parent document ID.
- Choose appropriate vector dimensions β match dimension to your embedding model (e.g., 1536 for OpenAI
text-embedding-ada-002, 3072 fortext-embedding-3-large).
Example: AI-Ready Document Table
-- Store document chunks with their embeddings
CREATE TABLE dbo.DocumentChunks (
ChunkId INT IDENTITY PRIMARY KEY,
DocumentId INT NOT NULL,
ChunkIndex INT NOT NULL, -- order within the document
ChunkText NVARCHAR(MAX) NOT NULL,
TokenCount INT NOT NULL,
Embedding VECTOR(1536), -- OpenAI ada-002 dimension
CreatedAt DATETIME2 DEFAULT SYSUTCDATETIME(),
CONSTRAINT fk_doc FOREIGN KEY (DocumentId)
REFERENCES dbo.Documents(DocumentId)
);
-- Index for fast retrieval by document
CREATE INDEX ix_chunk_doc ON dbo.DocumentChunks (DocumentId, ChunkIndex);
Chunking Strategy
Chunk size affects retrieval quality. Common strategies:
1.2 Semi-Structured Data and JSON in T-SQL
AI applications frequently receive JSON from LLM APIs. SQL Server provides first-class JSON support for parsing, querying, and storing JSON data.
-- Store LLM interaction logs as JSON
CREATE TABLE dbo.LLMInteractions (
InteractionId UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
SessionId NVARCHAR(128) NOT NULL,
RequestPayload NVARCHAR(MAX),
ResponsePayload NVARCHAR(MAX),
ModelName NVARCHAR(128),
TokensUsed INT,
DurationMs INT,
CreatedAt DATETIME2 DEFAULT SYSUTCDATETIME(),
CONSTRAINT chk_request_json CHECK (ISJSON(RequestPayload) = 1),
CONSTRAINT chk_response_json CHECK (ISJSON(ResponsePayload) = 1)
);
-- Query JSON fields with JSON_VALUE and JSON_QUERY
SELECT
InteractionId,
JSON_VALUE(ResponsePayload, '$.choices[0].message.content') AS Answer,
JSON_VALUE(ResponsePayload, '$.usage.total_tokens') AS TotalTokens
FROM dbo.LLMInteractions
WHERE ModelName = 'gpt-4o';
-- Aggregate with OPENJSON for multi-value arrays
SELECT msg.role, msg.content
FROM dbo.LLMInteractions
CROSS APPLY OPENJSON(RequestPayload, '$.messages')
WITH (role NVARCHAR(50) '$.role', content NVARCHAR(MAX) '$.content') AS msg
WHERE InteractionId = '...';
1.3 Advanced T-SQL Techniques for AI Workloads
Batch Processing with MERGE
When generating embeddings in bulk, use MERGE to insert new chunks and update changed ones atomically:
MERGE dbo.DocumentChunks AS target
USING @stagingChunks AS source
ON target.DocumentId = source.DocumentId
AND target.ChunkIndex = source.ChunkIndex
WHEN MATCHED AND target.ChunkText != source.ChunkText THEN
UPDATE SET
ChunkText = source.ChunkText,
Embedding = NULL, -- mark for re-embedding
CreatedAt = SYSUTCDATETIME()
WHEN NOT MATCHED BY TARGET THEN
INSERT (DocumentId, ChunkIndex, ChunkText, TokenCount)
VALUES (source.DocumentId, source.ChunkIndex, source.ChunkText, source.TokenCount);
Filtered Indexes for Unembedded Rows
-- Efficiently find rows that still need embedding
CREATE INDEX ix_needs_embedding
ON dbo.DocumentChunks (ChunkId)
WHERE Embedding IS NULL;
Use filtered indexes on WHERE Embedding IS NULL to efficiently find rows that need embedding. This avoids full table scans during batch embedding pipelines.
Vector Data and Embeddings in SQL
Master the VECTOR data type, vector indexing with DiskANN, and semantic similarity search β all natively within SQL Server 2025 and Azure SQL.
2.1 The VECTOR Data Type
SQL Server 2025 and Azure SQL introduce the VECTOR native data type, enabling storage of high-dimensional floating-point arrays
(embeddings) as a first-class database column. Internally stored as compact binary, it presents as a JSON array for input/output.
Syntax
-- Declare a vector column: VECTOR(dimensions [, precision])
-- Default precision is float32; float16 available in preview
embedding VECTOR(1536) -- float32 (default)
embedding VECTOR(1536, float16) -- half-precision (more compact)
-- Supported dimension range: 1 β 1998
Creating Tables with Vectors
CREATE TABLE dbo.Products (
ProductId INT IDENTITY PRIMARY KEY,
Name NVARCHAR(255) NOT NULL,
Description NVARCHAR(MAX),
Category NVARCHAR(100),
Embedding VECTOR(1536) -- text-embedding-ada-002
);
Inserting Vector Data
Vectors are inserted as JSON arrays (the database converts them internally):
-- Direct insert from JSON array
INSERT INTO dbo.Products (Name, Description, Embedding)
VALUES (
'Surface Pro 11',
'A lightweight 2-in-1 laptop for professionals.',
'[0.0234, -0.1127, 0.0891, ...]' -- 1536 float values as JSON array
);
-- From a stored procedure that calls Azure OpenAI
DECLARE @emb VECTOR(1536);
EXEC dbo.GetEmbedding @text = N'Surface Pro 11 laptop', @embedding = @emb OUTPUT;
UPDATE dbo.Products
SET Embedding = @emb
WHERE ProductId = 1;
2.2 Vector Distance Functions
SQL Server 2025 provides VECTOR_DISTANCE() to compute similarity between two vectors:
| Metric | Syntax | Range | Use Case |
|---|---|---|---|
| Cosine | 'cosine' |
0 (identical) β 2 (opposite) | Best for NLP / text similarity. Ignores magnitude. |
| Euclidean | 'euclidean' |
0 β β | Geometric distance. Good when magnitude matters. |
| Dot Product | 'dot' |
-β β β (higher = more similar) | Fast; useful when vectors are L2-normalized. |
| Negative Dot | 'neg_dot' |
-β β β | Inverted for ORDER BY (lower = more similar). |
-- Semantic similarity search: top-5 most similar products
DECLARE @queryEmb VECTOR(1536);
EXEC dbo.GetEmbedding @text = N'portable ultrabook for developers', @embedding = @queryEmb OUTPUT;
SELECT TOP (5)
ProductId,
Name,
VECTOR_DISTANCE('cosine', Embedding, @queryEmb) AS Distance
FROM dbo.Products
WHERE Embedding IS NOT NULL
ORDER BY Distance ASC; -- lower cosine distance = more similar
When your embeddings are L2-normalized (unit length), cosine distance and dot product similarity are mathematically equivalent. Azure OpenAI embedding models return normalized vectors by default, so dot product search is slightly faster.
2.3 Vector Indexes (DiskANN)
Without an index, similarity search requires computing distance to every row β an O(n) brute-force scan. SQL Server 2025 introduces DiskANN (Disk-based Approximate Nearest Neighbor) indexes that provide sub-linear search time at the cost of approximate (not exact) results.
Creating a Vector Index
-- Create a DiskANN vector index for cosine similarity
CREATE VECTOR INDEX vix_products_embedding
ON dbo.Products (Embedding)
WITH (
METRIC = 'cosine', -- distance metric to optimize for
TYPE = 'DiskANN' -- only supported type currently
);
Using VECTOR_SEARCH for Index-Accelerated Search
-- Using the VECTOR_SEARCH function (leverages the DiskANN index)
DECLARE @queryEmb VECTOR(1536);
EXEC dbo.GetEmbedding @text = N'portable laptop for travel', @embedding = @queryEmb OUTPUT;
SELECT p.ProductId, p.Name, vs.Distance
FROM VECTOR_SEARCH(
TABLE dbo.Products AS p,
COLUMN Embedding,
SIMILAR_TO @queryEmb,
METRIC 'cosine',
TOP 10
) AS vs
JOIN dbo.Products p ON p.ProductId = vs.ProductId
ORDER BY vs.Distance;
DiskANN produces approximate nearest neighbors. For most RAG and recommendation use cases this is acceptable. For applications requiring exact results (e.g., duplicate detection), use brute-force VECTOR_DISTANCE() without the index.
DiskANN Index Parameters
cosine, euclidean, or dot. Must match your search queries.DiskANN is supported. Graph-based ANN algorithm from Microsoft Research.sys.dm_db_index_physical_stats.2.4 Vector Operations and Utility Functions
-- VECTOR_NORM: compute the L2 norm of a vector
SELECT VECTOR_NORM(Embedding, 'norm2') AS EuclideanNorm
FROM dbo.Products WHERE ProductId = 1;
-- VECTOR_NORMALIZE: return unit-length vector
SELECT VECTOR_NORMALIZE(Embedding) AS NormalizedEmb
FROM dbo.Products WHERE ProductId = 1;
-- VECTOR_DIMENSIONS: get the dimension count
SELECT VECTOR_DIMENSIONS(Embedding) AS Dims FROM dbo.Products WHERE ProductId = 1;
-- Cast between float32 and float16
SELECT CAST(Embedding AS VECTOR(1536, float16)) AS CompactEmb
FROM dbo.Products WHERE ProductId = 1;
Implementing RAG Workflows
Build Retrieval-Augmented Generation pipelines that ground LLM outputs with your SQL data, reducing hallucinations and enabling domain-specific AI applications.
3.1 What is RAG?
Retrieval-Augmented Generation (RAG) is an architectural pattern that enhances LLM responses by fetching relevant context from an external data source before generating an answer. Instead of relying solely on the model's pre-trained knowledge, RAG supplies the LLM with up-to-date, domain-specific data retrieved from your database.
3.2 RAG Architecture in Azure SQL
The typical Azure SQL RAG pipeline consists of two phases:
Phase 1: Indexing (Offline)
- Load source documents (PDFs, articles, product descriptions) into
dbo.Documents. - Chunk documents into manageable segments (512β1024 tokens with overlap).
- Generate embeddings for each chunk via Azure OpenAI (
text-embedding-3-smallortext-embedding-ada-002). - Store chunks + embeddings in
dbo.DocumentChunkswith theVECTORcolumn. - Build a DiskANN index on the embedding column for fast retrieval.
Phase 2: Querying (Online)
- User submits a natural language question.
- Generate a query embedding for the user's question using the same embedding model.
- Execute vector similarity search in Azure SQL to retrieve top-K relevant chunks.
- Concatenate retrieved chunks into a context prompt.
- Send user question + context to Azure OpenAI GPT model and return the answer.
3.3 Complete T-SQL RAG Implementation
-- Step 1: Create the knowledge base tables
CREATE TABLE dbo.KBDocuments (
DocId INT IDENTITY PRIMARY KEY,
Title NVARCHAR(500) NOT NULL,
Source NVARCHAR(500),
CreatedAt DATETIME2 DEFAULT SYSUTCDATETIME()
);
CREATE TABLE dbo.KBChunks (
ChunkId INT IDENTITY PRIMARY KEY,
DocId INT NOT NULL REFERENCES dbo.KBDocuments(DocId),
ChunkText NVARCHAR(MAX) NOT NULL,
Embedding VECTOR(1536),
CONSTRAINT uq_chunk UNIQUE (DocId, ChunkId)
);
-- Vector index for fast retrieval
CREATE VECTOR INDEX vix_kb_embedding ON dbo.KBChunks (Embedding)
WITH (METRIC = 'cosine', TYPE = 'DiskANN');
-- Step 2: Stored procedure to retrieve relevant context
CREATE OR ALTER PROCEDURE dbo.RAGRetrieve
@Question NVARCHAR(MAX),
@TopK INT = 5,
@Context NVARCHAR(MAX) OUTPUT
AS
BEGIN
SET NOCOUNT ON;
-- Generate embedding for the question (calls Azure OpenAI via EXTERNAL MODEL)
DECLARE @queryEmb VECTOR(1536);
SET @queryEmb = AI_GENERATE_EMBEDDINGS(@Question
USING MODEL AzureOpenAIEmbeddings);
-- Retrieve top-K similar chunks
SELECT TOP (@TopK)
c.ChunkText,
d.Title,
VECTOR_DISTANCE('cosine', c.Embedding, @queryEmb) AS Score
INTO #relevant
FROM dbo.KBChunks c
JOIN dbo.KBDocuments d ON d.DocId = c.DocId
WHERE c.Embedding IS NOT NULL
ORDER BY Score ASC;
-- Build context string
SELECT @Context = STRING_AGG(
N'[Source: ' + Title + N']' + CHAR(10) + ChunkText,
CHAR(10) + CHAR(10)
)
FROM #relevant;
END;
-- Step 3: End-to-end RAG query stored procedure
CREATE OR ALTER PROCEDURE dbo.AskQuestion
@Question NVARCHAR(MAX),
@Answer NVARCHAR(MAX) OUTPUT
AS
BEGIN
SET NOCOUNT ON;
-- Retrieve grounding context
DECLARE @context NVARCHAR(MAX);
EXEC dbo.RAGRetrieve @Question = @Question, @Context = @context OUTPUT;
-- Build system + user prompt
DECLARE @prompt NVARCHAR(MAX) = N'You are a helpful assistant. '
+ N'Answer the user question ONLY using the provided context. '
+ N'If the context does not contain the answer, say "I don''t know."'
+ CHAR(10) + CHAR(10)
+ N'CONTEXT:' + CHAR(10) + @context;
-- Call Azure OpenAI GPT model
DECLARE @request NVARCHAR(MAX) = JSON_OBJECT(
'messages': JSON_ARRAY(
JSON_OBJECT('role': 'system', 'content': @prompt),
JSON_OBJECT('role': 'user', 'content': @Question)
),
'max_tokens': 1000
);
EXEC sp_invoke_external_rest_endpoint
@url = N'https://YOUR_OAI.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-02-01',
@method = 'POST',
@payload = @request,
@response = @Answer OUTPUT;
END;
3.4 Hybrid Search (Vector + Full-Text)
For best retrieval quality, combine vector similarity search with traditional keyword (full-text) search. This is called hybrid search:
-- Enable Full-Text Search on chunk text
CREATE FULLTEXT CATALOG ft_kb AS DEFAULT;
CREATE FULLTEXT INDEX ON dbo.KBChunks (ChunkText) KEY INDEX PK_KBChunks;
-- Hybrid search: combine BM25 keyword score with cosine vector score
DECLARE @queryEmb VECTOR(1536);
SET @queryEmb = AI_GENERATE_EMBEDDINGS(@Question USING MODEL AzureOpenAIEmbeddings);
SELECT TOP 10
c.ChunkId,
c.ChunkText,
-- Reciprocal Rank Fusion of keyword and vector scores
(1.0 / (60 + ROW_NUMBER() OVER (ORDER BY ft.RANK DESC))) +
(1.0 / (60 + ROW_NUMBER() OVER (ORDER BY VECTOR_DISTANCE('cosine', c.Embedding, @queryEmb)))) AS RRFScore
FROM dbo.KBChunks c
INNER JOIN CONTAINSTABLE(dbo.KBChunks, ChunkText, @Question) AS ft
ON c.ChunkId = ft.[KEY]
ORDER BY RRFScore DESC;
Hybrid search outperforms pure vector search for queries that contain specific product codes, names, or acronyms β exact keywords that semantic embeddings may not capture well. Use hybrid search as your default RAG retrieval strategy.
T-SQL AI Functions and Azure OpenAI Integration
Use built-in AI functions, External Models, and REST integrations to call LLMs and embedding services directly from T-SQL.
4.1 External Models
SQL Server 2025 introduces External Models β a mechanism to register AI model endpoints (Azure OpenAI, local ONNX models, etc.) and invoke them from T-SQL using standard functions.
-- Register an Azure OpenAI embedding model
CREATE EXTERNAL MODEL AzureOpenAIEmbeddings
WITH (
LOCATION = 'https://YOUR_RESOURCE.openai.azure.com',
API_TYPE = 'azure_openai',
API_KEY_SECRET = 'AZURE_OPENAI_KEY',
DEPLOYMENT = 'text-embedding-ada-002',
TASK = 'text-embedding'
);
-- Register a chat completion model
CREATE EXTERNAL MODEL GPT4oChat
WITH (
LOCATION = 'https://YOUR_RESOURCE.openai.azure.com',
API_TYPE = 'azure_openai',
API_KEY_SECRET = 'AZURE_OPENAI_KEY',
DEPLOYMENT = 'gpt-4o',
TASK = 'chat-completion'
);
API keys are stored securely using database secrets. Never hard-code API keys in plain T-SQL. Use CREATE DATABASE SCOPED CREDENTIAL or Azure Managed Identity where possible.
4.2 AI_GENERATE_EMBEDDINGS
Generate embeddings from text using a registered external model:
-- Generate embedding for a single string
SELECT AI_GENERATE_EMBEDDINGS(
N'machine learning and artificial intelligence'
USING MODEL AzureOpenAIEmbeddings
) AS Embedding;
-- Generate and store embeddings in bulk
UPDATE dbo.DocumentChunks
SET Embedding = AI_GENERATE_EMBEDDINGS(ChunkText USING MODEL AzureOpenAIEmbeddings)
WHERE Embedding IS NULL;
4.3 AI_GENERATE_TEXT (Chat Completions)
-- Generate text using a chat model
SELECT AI_GENERATE_TEXT(
'Summarize this product review in one sentence: ' + ReviewText
USING MODEL GPT4oChat
) AS Summary
FROM dbo.ProductReviews
WHERE ReviewId = 42;
-- Batch summarization with CROSS APPLY
SELECT
r.ReviewId,
r.ProductId,
g.GeneratedSummary
FROM dbo.ProductReviews r
CROSS APPLY (
SELECT AI_GENERATE_TEXT(
N'Classify the sentiment (Positive/Negative/Neutral) and summarize: '
+ r.ReviewText
USING MODEL GPT4oChat
) AS GeneratedSummary
) g
WHERE r.CreatedAt > DATEADD(day, -7, SYSUTCDATETIME());
4.4 Calling Azure OpenAI via sp_invoke_external_rest_endpoint
For more control over the HTTP request (custom headers, retry logic), use
sp_invoke_external_rest_endpoint:
The example below uses a placeholder API key for illustration. In production, always authenticate via Azure Managed Identity (see Section 5.1). Also verify the latest Azure OpenAI API version at aka.ms/oai/docs.
DECLARE @payload NVARCHAR(MAX) = JSON_OBJECT(
'model' : 'gpt-4o',
'messages' : JSON_ARRAY(
JSON_OBJECT('role': 'system',
'content': 'You are an expert SQL tutor.'),
JSON_OBJECT('role': 'user',
'content': 'Explain what a vector index is in SQL Server 2025.')
),
'max_tokens' : 500,
'temperature': 0.2
);
DECLARE @response NVARCHAR(MAX);
EXEC sp_invoke_external_rest_endpoint
@url = N'https://YOUR_RESOURCE.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-02-01', -- verify latest API version at aka.ms/oai/docs
@method = 'POST',
@headers = '{"Content-Type":"application/json","api-key":"YOUR_API_KEY"}', -- In production use Managed Identity; never hardcode keys
@payload = @payload,
@response = @response OUTPUT;
-- Parse the JSON response
SELECT JSON_VALUE(@response, '$.result.choices[0].message.content') AS Answer;
4.5 Common AI Task Patterns
Sentiment Analysis
CREATE OR ALTER FUNCTION dbo.GetSentiment(@text NVARCHAR(MAX))
RETURNS NVARCHAR(20) AS
BEGIN
DECLARE @result NVARCHAR(MAX);
SET @result = AI_GENERATE_TEXT(
N'Respond with only one word: Positive, Negative, or Neutral. '
+ N'Sentiment of: ' + @text
USING MODEL GPT4oChat
);
RETURN LTRIM(RTRIM(@result));
END;
Entity Extraction
-- Extract named entities as JSON from support tickets
SELECT
TicketId,
AI_GENERATE_TEXT(
N'Extract a JSON object with keys "product", "issue", "priority" from: '
+ TicketText
USING MODEL GPT4oChat
) AS ExtractedEntities
FROM dbo.SupportTickets
WHERE Status = 'Open';
Translation
-- Translate product descriptions to English
UPDATE dbo.Products
SET DescriptionEN = AI_GENERATE_TEXT(
N'Translate to English: ' + Description
USING MODEL GPT4oChat
)
WHERE Language != 'en' AND DescriptionEN IS NULL;
Know the difference between AI_GENERATE_EMBEDDINGS (returns a VECTOR for similarity search) and AI_GENERATE_TEXT (returns NVARCHAR from a chat/completion model). Also understand when to use sp_invoke_external_rest_endpoint vs built-in AI functions.
Security, Scaling, and DevOps
Design AI-enabled database solutions that are secure, compliant, scalable, and deployed with CI/CD best practices.
5.1 Securing AI-Enabled Solutions
Managed Identity for External Models
Prefer Azure Managed Identity over API keys for authenticating Azure SQL to Azure OpenAI. This eliminates secret rotation overhead and follows the principle of least privilege.
-- Register external model using Managed Identity
CREATE EXTERNAL MODEL AzureOpenAIEmbeddings
WITH (
LOCATION = 'https://YOUR_RESOURCE.openai.azure.com',
API_TYPE = 'azure_openai',
AUTHENTICATION = 'managed_identity', -- No API key needed
DEPLOYMENT = 'text-embedding-ada-002',
TASK = 'text-embedding'
);
Row-Level Security for Multi-Tenant AI
In multi-tenant applications, ensure users only retrieve embeddings from their own data:
-- Create security policy using RLS
CREATE SCHEMA Security;
GO
CREATE FUNCTION Security.RLS_TenantFilter(@TenantId INT)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN SELECT 1 AS Result
WHERE @TenantId = CAST(SESSION_CONTEXT(N'TenantId') AS INT);
CREATE SECURITY POLICY TenantIsolation
ADD FILTER PREDICATE Security.RLS_TenantFilter(TenantId)
ON dbo.KBChunks,
ADD BLOCK PREDICATE Security.RLS_TenantFilter(TenantId)
ON dbo.KBChunks AFTER INSERT;
-- Set tenant context for each session
EXEC sp_set_session_context @key = N'TenantId', @value = 42;
Encrypting Sensitive AI Outputs
-- Use Always Encrypted for sensitive LLM outputs (PII in responses)
-- Define column encryption in the table DDL
CREATE TABLE dbo.SensitiveResponses (
ResponseId INT IDENTITY PRIMARY KEY,
UserId INT,
UserQuery NVARCHAR(MAX),
LLMResponse NVARCHAR(MAX)
ENCRYPTED WITH (
COLUMN_ENCRYPTION_KEY = CEK_AI_Outputs,
ENCRYPTION_TYPE = Randomized,
ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256'
)
);
Auditing AI Interactions
-- Log all LLM calls with user context for audit trails
CREATE TABLE dbo.AIAuditLog (
LogId BIGINT IDENTITY PRIMARY KEY,
UserId NVARCHAR(128) DEFAULT SUSER_SNAME(),
AppName NVARCHAR(256) DEFAULT APP_NAME(),
Operation NVARCHAR(50),
InputHash VARBINARY(32), -- SHA2 of input (avoid storing PII)
TokensUsed INT,
ModelName NVARCHAR(128),
CreatedAt DATETIME2 DEFAULT SYSUTCDATETIME()
);
5.2 Performance and Scaling
Batch Embedding Generation
-- Process embeddings in configurable batches to avoid timeouts
CREATE OR ALTER PROCEDURE dbo.BatchGenerateEmbeddings
@BatchSize INT = 100
AS
BEGIN
SET NOCOUNT ON;
DECLARE @processed INT = 0;
WHILE EXISTS (SELECT 1 FROM dbo.KBChunks WHERE Embedding IS NULL)
BEGIN
UPDATE TOP (@BatchSize) dbo.KBChunks
SET Embedding = AI_GENERATE_EMBEDDINGS(ChunkText USING MODEL AzureOpenAIEmbeddings)
WHERE Embedding IS NULL;
SET @processed += @@ROWCOUNT;
-- Yield to avoid blocking; adjust delay as needed
WAITFOR DELAY '00:00:01';
END;
PRINT CAST(@processed AS VARCHAR) + ' rows embedded.';
END;
Monitoring Vector Search Performance
-- Monitor vector index health
SELECT
OBJECT_NAME(i.object_id) AS TableName,
i.name AS IndexName,
p.index_type_desc,
p.avg_fragmentation_in_percent,
p.page_count
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') p
JOIN sys.indexes i ON i.object_id = p.object_id AND i.index_id = p.index_id
WHERE i.name LIKE 'vix_%';
5.3 CI/CD for AI-Enabled Databases
GitHub Actions Workflow for SQL Deployment
# .github/workflows/deploy-sql.yml
name: Deploy AI Database Changes
on:
push:
branches: [main]
paths: ['sql/**']
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run SQL migrations
uses: azure/sql-action@v2
with:
connection-string: ${{ secrets.AZURE_SQL_CONNECTION_STRING }}
path: './sql/migrations/*.sql'
action: 'deploy'
- name: Run schema validation
uses: azure/sql-action@v2
with:
connection-string: ${{ secrets.AZURE_SQL_CONNECTION_STRING }}
path: './sql/tests/validate_schema.sql'
action: 'run'
- name: Rebuild vector indexes if schema changed
uses: azure/sql-action@v2
with:
connection-string: ${{ secrets.AZURE_SQL_CONNECTION_STRING }}
path: './sql/maintenance/rebuild_vector_indexes.sql'
action: 'run'
Database Project (dacpac) Best Practices
- Use SQL Server Database Projects (SSDT / SQL Projects extension for VS Code) to version-control schema objects including
EXTERNAL MODELdefinitions. - Store the
CREATE EXTERNAL MODELDDL in source control without secrets; inject secrets via deployment parameters or Key Vault references. - Use
pre-deploymentandpost-deploymentscripts for data migrations and embedding backfills. - Test schema changes against a staging Azure SQL database before deploying to production.
Enable Microsoft Defender for Azure SQL to detect anomalous query patterns β including potential prompt injection attacks via SQL queries that route untrusted user input to LLMs. Always sanitize and parameterize user inputs before passing them to AI functions.
Data API Builder
Expose SQL data as REST and GraphQL APIs for AI-driven applications without writing custom middleware.
6.1 What is Data API Builder?
Data API Builder (DAB) is a Microsoft open-source tool that automatically generates REST and GraphQL APIs on top of existing Azure SQL, SQL Server, MySQL, and Cosmos DB databases. It allows AI applications (LangChain agents, custom GPT actions, etc.) to query SQL data via standard HTTP APIs without requiring application-layer middleware.
6.2 Configuration
// dab-config.json
{
"$schema": "https://dataapibuilder.azureedge.net/schemas/v1.2.0/dab.draft.schema.json",
"data-source": {
"database-type": "mssql",
"connection-string": "@env('AZURE_SQL_CONNECTION_STRING')"
},
"runtime": {
"rest": { "enabled": true, "path": "/api" },
"graphql": { "enabled": true, "path": "/graphql" }
},
"entities": {
"Product": {
"source": "dbo.Products",
"permissions": [{ "role": "anonymous", "actions": ["read"] }]
},
"SearchProducts": {
"source": {
"object": "dbo.SemanticProductSearch",
"type": "stored-procedure",
"parameters": { "query": "" }
},
"rest": { "methods": ["GET"] },
"permissions": [{ "role": "authenticated", "actions": ["execute"] }]
}
}
}
6.3 Exposing Vector Search via Custom Actions
-- Create a stored procedure DAB can expose as a REST endpoint
CREATE OR ALTER PROCEDURE dbo.SemanticProductSearch
@query NVARCHAR(MAX),
@topK INT = 5
AS
BEGIN
SET NOCOUNT ON;
DECLARE @emb VECTOR(1536);
SET @emb = AI_GENERATE_EMBEDDINGS(@query USING MODEL AzureOpenAIEmbeddings);
SELECT TOP (@topK)
ProductId,
Name,
Description,
VECTOR_DISTANCE('cosine', Embedding, @emb) AS Score
FROM dbo.Products
WHERE Embedding IS NOT NULL
ORDER BY Score ASC;
END;
After mapping this stored procedure in dab-config.json, it becomes available as:
GET /api/SearchProducts?query=portable+laptop&topK=5
6.4 Authentication and Authorization
// Role-based access in dab-config.json
"entities": {
"KBChunks": {
"source": "dbo.KBChunks",
"permissions": [
{ "role": "anonymous", "actions": [] },
{ "role": "reader", "actions": ["read"] },
{ "role": "contributor", "actions": ["create", "read", "update"] },
{ "role": "admin", "actions": ["*"] }
]
}
}
Data API Builder endpoints can be registered as OpenAPI actions in Microsoft Copilot Studio or as Custom GPT actions in OpenAI β enabling conversational AI agents to query your SQL database with natural language, backed by vector search and your real data.
Flashcard Quiz
Test your knowledge with these exam-style flashcards. Click a card to reveal the answer, then rate yourself.