AI Embeddings - How They Work
AI embeddings are numerical representations of text that capture semantic meaning. They power semantic search by converting words, sentences, or documents into vectors (arrays of numbers) that machines can compare mathematically.
What Are Embeddings?
The Core Concept
Embeddings map text to points in high-dimensional space where similar meanings are close together:
// Text → Vector (simplified to 3 dimensions for visualization)
"cat" → [0.8, 0.2, 0.1]
"kitten" → [0.7, 0.3, 0.2] // Close to "cat"
"dog" → [0.6, 0.4, 0.1] // Close to "cat" (both animals)
"car" → [0.1, 0.2, 0.9] // Far from "cat"
// In reality: 384-1536 dimensions
"cat" → [0.23, -0.45, 0.67, ..., 0.12] // 768 numbers
Why Vectors?
Vectors enable mathematical operations on meaning:
// Distance between concepts
distance("cat", "kitten") = 0.15 // Very similar
distance("cat", "car") = 0.92 // Very different
// Semantic operations
king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
How Embeddings Are Generated
Transformer Models
Modern embeddings use transformer neural networks trained on billions of text examples:
Input: "The cat sat on the mat"
1. Tokenization:
["The", "cat", "sat", "on", "the", "mat"]
2. Token Embeddings:
Each word → initial vector
3. Transformer Layers (attention):
- "cat" looks at context: "The ___ sat on"
- Understands "cat" is subject, "sat" is action
- Updates embedding based on surrounding words
4. Pooling (averaging):
Combine token embeddings → single sentence embedding
Output: [0.123, -0.456, 0.789, ..., 0.234] // 768 dimensions
Training Process
Embeddings are trained to bring similar texts together:
Training data pairs:
✅ "search engine" ↔ "how to find information" (similar)
❌ "search engine" ↔ "cooking pasta" (different)
Loss function pushes similar pairs together:
distance(similar_pair) → minimize
distance(different_pair) → maximize
After training on billions of examples:
- Model learns language patterns
- Captures semantic relationships
- Understands synonyms, concepts
Xenova Transformers
What Is It?
Xenova Transformers is a JavaScript port of Hugging Face's Transformers library. It runs transformer models entirely in the browser or Node.js, enabling local embedding generation without API calls.
Key Features:
- Pure JavaScript/TypeScript
- Runs in browser and Node.js
- Uses ONNX Runtime for performance
- No server or API required
- Privacy-friendly (all processing local)
Installation
npm install @xenova/transformers
# or
pnpm add @xenova/transformers
Basic Usage
import { pipeline } from '@xenova/transformers';
// Load embedding model (downloads ~90MB on first run)
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2'
);
// Generate embedding
const text = "How do I deploy a web application?";
const output = await embedder(text, {
pooling: 'mean', // Average token embeddings
normalize: true // Normalize to unit length
});
// Result: Float32Array of 384 dimensions
const embedding = Array.from(output.data);
console.log(embedding);
// [0.123, -0.456, 0.789, ..., 0.234]
Popular Models
Model | Dimensions | Size | Speed | Quality |
---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | 90 MB | Fast | Good |
all-mpnet-base-v2 | 768 | 420 MB | Medium | Better |
bge-small-en-v1.5 | 384 | 133 MB | Fast | Excellent |
bge-base-en-v1.5 | 768 | 436 MB | Medium | Excellent |
Batch Processing
Process multiple texts efficiently:
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2'
);
const texts = [
"How to deploy a web application?",
"Debugging JavaScript errors",
"Best practices for async code"
];
// Generate embeddings for all texts at once
const embeddings = await Promise.all(
texts.map(text => embedder(text, {
pooling: 'mean',
normalize: true
}))
);
// Use embeddings
embeddings.forEach((output, i) => {
console.log(`Text ${i}: ${texts[i]}`);
console.log(`Embedding: [${Array.from(output.data).slice(0, 3).join(', ')}, ...]`);
});
How Astro Vault Uses Embeddings
Indexing Content
// scripts/index-content.ts
import { indexContent } from '@logan/libsql-search';
import { getTursoClient } from '../src/lib/turso';
const client = getTursoClient();
// Articles from markdown files
const articles = [
{
slug: 'deployment-guide',
title: 'How to Deploy Web Applications',
content: 'Step-by-step guide to deploying...',
tags: ['deployment', 'devops'],
},
// ... more articles
];
// Generate embeddings and store in database
// Uses Xenova Transformers under the hood
await indexContent(
client,
'articles', // Table name
articles, // Content to index
'local', // Embedding provider (Xenova)
768 // Embedding dimensions
);
What happens:
- Each article's text is sent to Xenova Transformers
- Model generates 768-dimensional embedding vector
- Vector stored in LibSQL database alongside article text
- Vector index created for fast similarity search
Searching
// src/pages/api/search.json.ts
import { searchArticles } from '@logan/libsql-search';
// User searches: "how do I push my app to production?"
const query = "how do I push my app to production?";
// 1. Convert query to embedding (Xenova)
// 2. Search database for similar vectors (LibSQL)
const results = await searchArticles(
client,
'articles',
query,
'local', // Use Xenova for embedding
10 // Return top 10 results
);
// Results ranked by similarity:
// [
// { title: "Deploying to Production", similarity: 0.89 },
// { title: "CI/CD Pipeline Setup", similarity: 0.82 },
// { title: "Docker Deployment Guide", similarity: 0.78 }
// ]
Embedding Model Architecture
Transformer Components
Input: "JavaScript async programming"
┌─────────────────────────────────────┐
│ 1. Tokenization │
│ ["JavaScript", "async", "programming"] │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 2. Token Embeddings │
│ [vec₁, vec₂, vec₃] │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 3. Self-Attention Layers (x6-12) │
│ - Each token attends to others │
│ - Captures context relationships │
│ - Updates embeddings iteratively │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 4. Pooling Layer │
│ - Mean pool token embeddings │
│ - Produces single vector │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 5. Normalization │
│ - Scale to unit length │
│ - Enables cosine similarity │
└─────────────────────────────────────┘
↓
[0.23, -0.45, ..., 0.67] // 768D
Self-Attention Mechanism
How models understand context:
// Sentence: "The bank by the river is steep"
// Word: "bank"
// Without context:
"bank" → [financial institution OR river edge]
// With self-attention:
"bank" pays attention to:
"river" (high weight: 0.8)
"steep" (high weight: 0.6)
"The" (low weight: 0.1)
// Result: "bank" embedding emphasizes river edge meaning
// Different from "bank" in "I went to the bank to deposit money"
Advanced: Model Fine-Tuning
Why Fine-Tune?
Pre-trained models are general-purpose. Fine-tuning specializes them:
// General model
query: "exception handling"
results: "error handling", "try-catch", "debugging" // Good
// Fine-tuned for docs
query: "exception handling"
results: "error handling patterns", "exception hierarchy",
"custom exceptions" // Better - understands your domain
Fine-Tuning Process
// 1. Prepare training pairs (similar documents)
const trainingData = [
{
anchor: "How to deploy applications",
positive: "Deployment guide for web apps", // Similar
negative: "Database query optimization" // Different
},
// ... thousands more
];
// 2. Train with sentence-transformers library
// (Python, then export to ONNX for Xenova)
import { SentenceTransformer, losses } from 'sentence-transformers';
const model = new SentenceTransformer('all-MiniLM-L6-v2');
model.fit(trainingData, epochs=3, batchSize=16);
// 3. Export to ONNX format
model.save('custom-model', 'onnx');
// 4. Use in Xenova
const embedder = await pipeline(
'feature-extraction',
'./custom-model'
);
Embedding Providers Comparison
Local (Xenova Transformers)
// Astro Vault default
await indexContent(client, 'articles', articles, 'local', 768);
Pros:
- ✅ Free and unlimited
- ✅ Privacy-friendly (no data sent to APIs)
- ✅ No API keys required
- ✅ Works offline
- ✅ Consistent performance
Cons:
- ❌ Slower (100-200ms per embedding)
- ❌ Uses CPU/GPU resources
- ❌ Initial model download (~400MB)
- ❌ Lower quality than latest cloud models
OpenAI
import { OpenAI } from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await openai.embeddings.create({
model: 'text-embedding-3-small', // or text-embedding-3-large
input: text,
});
const embedding = response.data[0].embedding; // 1536 dimensions
Pros:
- ✅ High quality embeddings
- ✅ Fast API (20-50ms)
- ✅ No local compute needed
- ✅ Regularly improved
Cons:
- ❌ Costs money ($0.02 per 1M tokens)
- ❌ Privacy concerns (data sent to OpenAI)
- ❌ Requires API key
- ❌ Rate limits (3000 RPM)
Google Gemini
import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: 'embedding-001' });
const result = await model.embedContent(text);
const embedding = result.embedding.values; // 768 dimensions
Pros:
- ✅ High quality embeddings
- ✅ Generous free tier
- ✅ Fast API
- ✅ Integrated with Google Cloud
Cons:
- ❌ Privacy concerns
- ❌ Requires API key
- ❌ Rate limits
- ❌ Smaller model selection
Performance Considerations
Model Size vs Speed vs Quality
Small models (384D):
- Size: ~90 MB
- Speed: 50ms per embedding
- Quality: Good for general search
- Use case: Documentation, blogs
Medium models (768D):
- Size: ~400 MB
- Speed: 150ms per embedding
- Quality: Better semantic understanding
- Use case: E-commerce, support docs
Large models (1536D):
- Size: API-only (OpenAI, Cohere)
- Speed: 20-50ms (API)
- Quality: Excellent
- Use case: Complex search, RAG applications
Caching Strategies
// Cache embeddings to avoid recomputation
import { LRUCache } from 'lru-cache';
const embeddingCache = new LRUCache<string, Float32Array>({
max: 1000, // Cache 1000 embeddings
});
async function getEmbedding(text: string) {
const cached = embeddingCache.get(text);
if (cached) return cached;
const embedding = await embedder(text, {
pooling: 'mean',
normalize: true
});
embeddingCache.set(text, embedding.data);
return embedding.data;
}
Batch Processing for Speed
// Process 100 articles
// ❌ Slow: Sequential
for (const article of articles) {
await generateEmbedding(article.content); // 100 * 150ms = 15s
}
// ✅ Fast: Batched
const batchSize = 10;
for (let i = 0; i < articles.length; i += batchSize) {
const batch = articles.slice(i, i + batchSize);
await Promise.all(
batch.map(a => generateEmbedding(a.content))
); // 10 batches * 150ms = 1.5s
}
Debugging Embeddings
Visualizing Embeddings
// Reduce 768D to 2D for visualization using t-SNE or UMAP
import { TSNE } from 'tsne-js';
const embeddings = [
await generateEmbedding("JavaScript tutorial"),
await generateEmbedding("Python tutorial"),
await generateEmbedding("Web deployment"),
// ... more
];
// Reduce to 2D
const tsne = new TSNE({ dim: 2, perplexity: 30 });
const points2D = tsne.fit(embeddings);
// Plot points - similar concepts cluster together
console.log(points2D);
// [[0.2, 0.8], [0.3, 0.7], [8.1, 9.2], ...]
// JS tutorial Py tutorial Deployment
// (close together) (far away)
Testing Similarity
import { cosineSimilarity } from '@logan/libsql-search';
// Test if embeddings make sense
const tests = [
["cat", "kitten"], // Should be high (>0.8)
["cat", "dog"], // Should be medium (~0.6)
["cat", "car"], // Should be low (<0.3)
["deploy", "deployment"], // Should be very high (>0.9)
];
for (const [word1, word2] of tests) {
const emb1 = await generateEmbedding(word1);
const emb2 = await generateEmbedding(word2);
const similarity = cosineSimilarity(emb1, emb2);
console.log(`"${word1}" vs "${word2}": ${similarity.toFixed(2)}`);
}
Use Cases
✅ Perfect For
Semantic Search: Find documents by meaning
query: "how do I make my site faster?"
finds: "performance optimization", "speed improvements", "caching strategies"
Recommendation Systems: "More like this"
article: "React Hooks Tutorial"
recommends: "Modern React Patterns", "State Management Guide", "Custom Hooks"
Content Deduplication: Find similar content
checkDuplicate("Getting started with React",
"React beginner's guide")
similarity: 0.92 // Likely duplicate
Question Answering: Match questions to answers
question: "What's the difference between let and var?"
answer: "Variable scoping in JavaScript" (from knowledge base)
⚠️ Not Ideal For
Exact Matching: Use full-text search instead
// Bad: Semantic search for exact strings
query: "React.useState"
// Good: Full-text search
WHERE function_name = 'React.useState'
Numeric/Date Filtering: Use database indexes
// Bad: Embedding-based filtering
query: "articles from 2023"
// Good: SQL WHERE clause
WHERE created_at >= '2023-01-01'
Very Large Scale: Use specialized vector DBs
// < 1M documents: LibSQL works great
// > 10M documents: Consider Pinecone, Weaviate, Qdrant
Resources
- Xenova Transformers: huggingface.co/docs/transformers.js
- Sentence Transformers: sbert.net
- Transformer Models: huggingface.co/models
- Embeddings Explained: openai.com/blog/introducing-text-and-code-embeddings
- Vector Search Theory: pinecone.io/learn/vector-embeddings