Full-Text Search in Databases

search full-text postgresql mysql indexing

Full-Text Search in Databases

Full-text search is a database feature that enables fast, ranked text searching with linguistic awareness. It's 10-100x faster than LIKE queries and understands language concepts like stemming, stop words, and relevance ranking.

How Full-Text Search Works

Inverted Index

Instead of scanning documents, full-text search uses an inverted index:

Traditional table:
doc1 → "javascript async programming"
doc2 → "async functions in javascript"
doc3 → "python async programming"

Inverted index:
"javascript" → [doc1, doc2]
"async"      → [doc1, doc2, doc3]
"programming" → [doc1, doc3]
"functions"   → [doc2]
"python"      → [doc3]

Lookup Process:

-- Search for "javascript async"
1. Lookup "javascript" → [doc1, doc2]
2. Lookup "async"      → [doc1, doc2, doc3]
3. Intersection        → [doc1, doc2]
4. Rank by relevance   → doc2 (both terms in title), doc1

Lexing and Stemming

Full-text search preprocesses text:

Input: "The developer is developing a JavaScript application"

1. Tokenization: ["The", "developer", "is", "developing", "a", "JavaScript", "application"]

2. Lowercasing: ["the", "developer", "is", "developing", "a", "javascript", "application"]

3. Stop word removal: ["developer", "developing", "javascript", "application"]

4. Stemming: ["develop", "develop", "javascript", "applic"]

Index: develop(2), javascript(1), applic(1)

PostgreSQL Full-Text Search

Setup

-- Add tsvector column for indexed search
ALTER TABLE articles
ADD COLUMN search_vector tsvector;

-- Create GIN index for fast lookups
CREATE INDEX idx_search_vector
ON articles USING GIN(search_vector);

-- Populate search vector
UPDATE articles
SET search_vector =
  setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
  setweight(to_tsvector('english', coalesce(content, '')), 'B') ||
  setweight(to_tsvector('english', coalesce(tags, '')), 'C');

Basic Search

-- Simple search
SELECT title, content
FROM articles
WHERE search_vector @@ to_tsquery('english', 'javascript');

-- Multiple terms (AND)
SELECT title, content
FROM articles
WHERE search_vector @@ to_tsquery('english', 'javascript & async');

-- Multiple terms (OR)
SELECT title, content
FROM articles
WHERE search_vector @@ to_tsquery('english', 'javascript | typescript');

-- Phrase search
SELECT title, content
FROM articles
WHERE search_vector @@ phraseto_tsquery('english', 'async function');

Ranking Results

-- Rank by relevance
SELECT
  title,
  ts_rank(search_vector, query) as rank
FROM articles,
  to_tsquery('english', 'javascript & async') query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 10;

-- Advanced ranking (prefer title matches)
SELECT
  title,
  ts_rank_cd(search_vector, query) as rank
FROM articles,
  to_tsquery('english', 'javascript') query
WHERE search_vector @@ query
ORDER BY rank DESC;

Highlighting Results

-- Show matching snippets
SELECT
  title,
  ts_headline('english', content, query, 'MaxWords=20, MinWords=10') as snippet
FROM articles,
  to_tsquery('english', 'javascript') query
WHERE search_vector @@ query;

-- Output:
-- "Learn <b>JavaScript</b> async patterns and best practices..."

Auto-Update with Triggers

-- Automatically update search_vector on insert/update
CREATE FUNCTION articles_search_vector_update() RETURNS trigger AS $$
BEGIN
  NEW.search_vector :=
    setweight(to_tsvector('english', coalesce(NEW.title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(NEW.content, '')), 'B') ||
    setweight(to_tsvector('english', coalesce(NEW.tags, '')), 'C');
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER articles_search_vector_trigger
BEFORE INSERT OR UPDATE ON articles
FOR EACH ROW
EXECUTE FUNCTION articles_search_vector_update();

MySQL Full-Text Search

Setup (InnoDB)

-- Create full-text index
CREATE FULLTEXT INDEX idx_fulltext
ON articles(title, content);

-- Or add to existing table
ALTER TABLE articles
ADD FULLTEXT INDEX idx_fulltext(title, content);

Basic Search

-- Natural language mode (default)
SELECT title, content,
  MATCH(title, content) AGAINST('javascript async') as relevance
FROM articles
WHERE MATCH(title, content) AGAINST('javascript async')
ORDER BY relevance DESC;

-- Boolean mode (AND, OR, NOT)
SELECT title, content
FROM articles
WHERE MATCH(title, content)
  AGAINST('+javascript +async' IN BOOLEAN MODE);

-- Exclude terms
SELECT title, content
FROM articles
WHERE MATCH(title, content)
  AGAINST('javascript -typescript' IN BOOLEAN MODE);

-- Phrase search
SELECT title, content
FROM articles
WHERE MATCH(title, content)
  AGAINST('"async function"' IN BOOLEAN MODE);

Relevance Tuning

-- Minimum word length (default: 4)
-- In my.cnf:
-- ft_min_word_len = 3

-- Stop words configuration
-- Custom stop words list
SET GLOBAL innodb_ft_user_stopword_table = 'mydb/my_stopwords';

SQLite Full-Text Search (FTS5)

Setup

-- Create FTS5 virtual table
CREATE VIRTUAL TABLE articles_fts USING fts5(
  title,
  content,
  tags,
  content='articles',
  content_rowid='id'
);

-- Populate from main table
INSERT INTO articles_fts(rowid, title, content, tags)
SELECT id, title, content, tags FROM articles;

-- Trigger to keep in sync
CREATE TRIGGER articles_ai AFTER INSERT ON articles BEGIN
  INSERT INTO articles_fts(rowid, title, content, tags)
  VALUES (NEW.id, NEW.title, NEW.content, NEW.tags);
END;

CREATE TRIGGER articles_au AFTER UPDATE ON articles BEGIN
  UPDATE articles_fts
  SET title = NEW.title, content = NEW.content, tags = NEW.tags
  WHERE rowid = NEW.id;
END;

CREATE TRIGGER articles_ad AFTER DELETE ON articles BEGIN
  DELETE FROM articles_fts WHERE rowid = OLD.id;
END;

Search

-- Basic search
SELECT title, content
FROM articles_fts
WHERE articles_fts MATCH 'javascript';

-- Boolean operators
SELECT title, content
FROM articles_fts
WHERE articles_fts MATCH 'javascript AND async';

-- Phrase search
SELECT title, content
FROM articles_fts
WHERE articles_fts MATCH '"async function"';

-- Column-specific search
SELECT title, content
FROM articles_fts
WHERE articles_fts MATCH 'title:javascript';

-- With ranking
SELECT
  articles.title,
  articles.content,
  bm25(articles_fts) as rank
FROM articles_fts
JOIN articles ON articles.id = articles_fts.rowid
WHERE articles_fts MATCH 'javascript'
ORDER BY rank;

Performance Comparison

Benchmark (1 million documents)

Method	Query Time	Index Size	Setup
LIKE '%term%'	2000ms	0 MB	None
PostgreSQL FTS	10ms	50 MB	GIN index
MySQL FTS	15ms	40 MB	FULLTEXT index
SQLite FTS5	12ms	45 MB	Virtual table

Indexing Time

Database	Initial Index	Incremental Update
PostgreSQL	30s (1M docs)	10ms per doc
MySQL	45s (1M docs)	15ms per doc
SQLite	25s (1M docs)	8ms per doc

Advanced Features

Fuzzy Matching (PostgreSQL)

-- Using pg_trgm extension
CREATE EXTENSION pg_trgm;

CREATE INDEX idx_trigram ON articles USING GIN (content gin_trgm_ops);

-- Similarity search
SELECT title, similarity(content, 'javascrpt') as sim
FROM articles
WHERE content % 'javascrpt'  -- Finds "javascript" despite typo
ORDER BY sim DESC;

Custom Dictionaries (PostgreSQL)

-- Create custom dictionary for tech terms
CREATE TEXT SEARCH DICTIONARY tech_dict (
  TEMPLATE = synonym,
  SYNONYMS = tech_synonyms
);

-- tech_synonyms file:
-- js → javascript
-- ts → typescript
-- py → python

-- Use in configuration
CREATE TEXT SEARCH CONFIGURATION tech_english (COPY = english);
ALTER TEXT SEARCH CONFIGURATION tech_english
  ALTER MAPPING FOR asciiword WITH tech_dict, english_stem;

Boosting Fields

-- PostgreSQL: Weight title 4x, content 1x
UPDATE articles
SET search_vector =
  setweight(to_tsvector('english', title), 'A') ||  -- Weight: 1.0
  setweight(to_tsvector('english', content), 'D');  -- Weight: 0.1

-- Ranks title matches much higher

Limitations of Full-Text Search

1. Language Understanding

-- Doesn't understand synonyms
-- Search "car" won't find "automobile", "vehicle"

-- Doesn't understand concepts
-- Search "deploy application" won't find "push to production"

2. No Semantic Understanding

-- Can't understand user intent
-- "best laptop for programming" → searches for exact words
-- Doesn't know "best" implies ranking, "for" is just a preposition

3. Exact Text Matching

-- Finds documents with the words, not the meaning
-- "python programming" finds docs with both words
-- But "learn python" might be more relevant even without "programming"

4. No Cross-Language Search

-- English stemmer doesn't help with other languages
-- Need separate configurations for each language

5. Complex Ranking

-- Ranking is based on term frequency and document length
-- Can't incorporate:
--   - User behavior (clicks, time on page)
--   - Document popularity
--   - Freshness
--   - Business logic

When to Use Full-Text Search

✅ Good Use Cases

Documentation sites: Search API references, guides
Blog search: Find articles by keywords
E-commerce: Basic product search
Internal tools: Employee directory, knowledge base
Content management: Find pages, articles

❌ When to Consider Alternatives

Natural language: "how do I deploy my app?" → Use semantic search
Typo tolerance: Heavy misspellings → Use Algolia/Elasticsearch
Multi-language: Global sites → Use dedicated search service
Complex ranking: ML-based relevance → Use Elasticsearch or semantic search
Scale: Billions of documents → Use distributed search

Full-Text vs Semantic Search

Feature	Full-Text	Semantic
Speed	10-20ms	50-100ms
Understanding	Keywords	Meaning
Typos	Limited	Better
Synonyms	Manual	Automatic
Setup	Built-in	Requires ML model
Resource usage	Low	Medium

Hybrid Approach

-- Combine both for best results
WITH fulltext_results AS (
  SELECT id, ts_rank(search_vector, query) * 2 as score
  FROM articles
  WHERE search_vector @@ to_tsquery('javascript')
),
semantic_results AS (
  SELECT id, vector_distance_cos(embedding, query_vector) as score
  FROM articles
  ORDER BY score DESC
  LIMIT 20
)
SELECT articles.*, (COALESCE(f.score, 0) + COALESCE(s.score, 0)) as final_score
FROM articles
LEFT JOIN fulltext_results f ON articles.id = f.id
LEFT JOIN semantic_results s ON articles.id = s.id
WHERE f.id IS NOT NULL OR s.id IS NOT NULL
ORDER BY final_score DESC;

Resources

PostgreSQL FTS: postgresql.org/docs/current/textsearch
MySQL FTS: dev.mysql.com/doc/refman/fulltext-search
SQLite FTS5: sqlite.org/fts5
Full-Text Search Explained: blog.crunchydata.com/postgres-full-text-search