Skip to main content
Glama
ADR-003-vector-database-semantic-search.md35.9 kB
# ADR-003: Vector Database and Semantic Search Architecture ## Status Proposed ## Context ### Current State ADR-001 introduced token-based keyword search with relevance ranking, which improved upon simple substring matching. However, this approach still has fundamental limitations: 1. **Lexical Matching Only**: Requires exact word matches (e.g., "automobile" won't match "car") 2. **No Semantic Understanding**: Cannot understand intent or context (e.g., "how to bake bread" won't match "bread recipe") 3. **Language Barriers**: Poor support for synonyms, related terms, or multilingual content 4. **No Cross-Content Search**: Cannot find related content across different apps (notes, files, calendar) 5. **Scaling Issues**: Performance degrades with large content collections ### User Needs LLM-powered applications (Claude via MCP) benefit significantly from semantic search capabilities: - **Context Discovery**: Find relevant information based on meaning, not just keywords - **Knowledge Retrieval**: Retrieve contextually relevant notes/files for task completion - **Cross-Referencing**: Connect related information across different content types - **Natural Language Queries**: Support conversational search patterns ### Technical Requirements 1. **Multi-User Environment**: OAuth-based with per-user isolation and permissions 2. **Multi-Tenant**: Single deployment serving multiple users with strict data isolation 3. **Real-Time Search**: Sub-second query latency for good UX 4. **Large Content**: Support for documents, PDFs, images with text extraction 5. **Privacy**: No external API calls for sensitive content (optionally self-hosted) 6. **Hybrid Search**: Combine semantic and keyword search for best results ## Decision We will implement **semantic search using a vector database** with the following architecture: ### Core Components 1. **Vector Database**: Qdrant as external sidecar service 2. **Embedding Strategy**: Configurable (OpenAI API / local models / self-hosted) 3. **Search Pattern**: Hybrid search (semantic + keyword fusion) 4. **Multi-Tenancy**: Single collection with user_id filtering 5. **Authorization**: Dual-phase (vector search + Nextcloud API verification) 6. **Sync Strategy**: Background worker with incremental updates (see ADR-002) ### Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────┐ │ User Request (OAuth) │ │ "find notes about baking" │ └───────────────────────────┬─────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────┐ │ MCP Server (Semantic Search Tool) │ │ │ │ 1. Generate query embedding │ │ 2. Search vector DB (user_id filter) │ │ 3. Verify permissions via Nextcloud API │ │ 4. Return ranked results │ └──────────┬─────────────────────────────┬────────────────────┘ │ │ ▼ ▼ ┌──────────────────────┐ ┌──────────────────────────────┐ │ Embedding Service │ │ Qdrant Vector Database │ │ - OpenAI API │ │ │ │ - Local Model │ │ Collection: nextcloud_content │ │ - Self-hosted │ │ - User-filtered vectors │ └──────────────────────┘ │ - Metadata for auth │ │ - HNSW index │ └───────────────────────────────┘ ▲ │ │ Indexing │ ┌──────────┴────────────────────┐ │ Background Sync Worker │ │ (see ADR-002 for auth) │ │ │ │ 1. Fetch user content │ │ 2. Generate embeddings │ │ 3. Upsert to Qdrant │ │ 4. Update metadata │ └───────────────────────────────┘ ``` ## Implementation Details ### 1. Vector Database Selection: Qdrant After evaluating multiple options, we select **Qdrant** for the following reasons: **Qdrant Advantages:** - ✅ Native async Python client (`qdrant-client`) - ✅ Efficient multi-tenancy via filtered search (no collection-per-user needed) - ✅ Built-in hybrid search support (dense + sparse vectors) - ✅ HNSW index with excellent performance - ✅ Lightweight Docker deployment - ✅ Persistent storage with snapshots - ✅ API key authentication - ✅ Active development and documentation **Comparison with Alternatives:** | Feature | Qdrant | Chroma | Weaviate | pgvector | |---------|--------|--------|----------|----------| | Async Python | ✅ | ⚠️ Sync | ✅ | ✅ | | Multi-tenant filtering | ✅ | ⚠️ Limited | ✅ | ✅ | | Hybrid search | ✅ | ❌ | ✅ | ⚠️ Manual | | Docker deployment | ✅ Easy | ✅ Easy | ✅ Complex | ⚠️ Postgres | | Memory usage | ✅ Low | ⚠️ Medium | ⚠️ High | ✅ Low | | Maturity | ✅ Production | ⚠️ Young | ✅ Production | ✅ Mature | **Decision**: Qdrant provides the best balance of features, performance, and ease of deployment. ### 2. Embedding Strategy: Tiered Approach Support multiple embedding backends with automatic fallback: ```python class EmbeddingService: """Unified interface for embedding generation""" def __init__(self): self.provider = self._detect_provider() def _detect_provider(self) -> EmbeddingProvider: """Auto-detect available embedding provider""" # Tier 1: OpenAI API (best quality, requires API key) if os.getenv("OPENAI_API_KEY"): return OpenAIEmbedding( model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"), api_key=os.getenv("OPENAI_API_KEY") ) # Tier 2: Self-hosted embedding service (good quality, privacy-preserving) if os.getenv("EMBEDDING_SERVICE_URL"): return HTTPEmbedding( url=os.getenv("EMBEDDING_SERVICE_URL"), model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5") ) # Tier 3: Local model (fallback, CPU-only) logger.warning("No cloud/hosted embeddings available, using local model") return LocalEmbedding( model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2") ) async def embed(self, text: str) -> list[float]: """Generate embedding vector for text""" return await self.provider.embed(text) async def embed_batch(self, texts: list[str]) -> list[list[float]]: """Generate embeddings for multiple texts (optimized)""" return await self.provider.embed_batch(texts) ``` #### 2.1 OpenAI Embeddings (Tier 1) ```python class OpenAIEmbedding(EmbeddingProvider): """OpenAI embedding API""" def __init__(self, model: str, api_key: str): self.client = AsyncOpenAI(api_key=api_key) self.model = model self.dimension = 1536 if "3-small" in model else 1536 # Model-dependent async def embed(self, text: str) -> list[float]: response = await self.client.embeddings.create( model=self.model, input=text ) return response.data[0].embedding async def embed_batch(self, texts: list[str]) -> list[list[float]]: # OpenAI supports batch up to 2048 inputs response = await self.client.embeddings.create( model=self.model, input=texts ) return [item.embedding for item in response.data] ``` **Costs**: text-embedding-3-small: $0.02 per 1M tokens (~4M characters) - 10,000 notes × 500 words avg = ~$0.10 to index - Searches are extremely cheap (~$0.00002 per query) #### 2.2 Self-Hosted Embeddings (Tier 2) ```python class HTTPEmbedding(EmbeddingProvider): """Self-hosted embedding service (Infinity, TEI, Ollama)""" def __init__(self, url: str, model: str): self.client = httpx.AsyncClient() self.url = url self.model = model self.dimension = 384 # Model-dependent (bge-small: 384, bge-base: 768) async def embed(self, text: str) -> list[float]: response = await self.client.post( f"{self.url}/embeddings", json={"input": text, "model": self.model} ) response.raise_for_status() return response.json()["data"][0]["embedding"] ``` **Self-Hosted Options**: - **Infinity**: Lightweight, OpenAI-compatible API, GPU support - **Text Embeddings Inference (TEI)**: HuggingFace official, optimized, Rust-based - **Ollama**: Easy setup, multi-model support, CPU/GPU #### 2.3 Local Embeddings (Tier 3) ```python class LocalEmbedding(EmbeddingProvider): """Local embedding using sentence-transformers (CPU fallback)""" def __init__(self, model: str): from sentence_transformers import SentenceTransformer self.model = SentenceTransformer(model) self.dimension = self.model.get_sentence_embedding_dimension() async def embed(self, text: str) -> list[float]: # Run in thread pool to avoid blocking loop = asyncio.get_event_loop() embedding = await loop.run_in_executor( None, self.model.encode, text ) return embedding.tolist() ``` **Recommended Local Models**: - `all-MiniLM-L6-v2`: 384 dims, fast, good quality - `all-mpnet-base-v2`: 768 dims, slower, better quality - `paraphrase-multilingual-MiniLM-L12-v2`: Multilingual support ### 3. Vector Database Schema ```python # Qdrant collection configuration collection_config = { "collection_name": "nextcloud_content", "vectors_config": { "size": 384, # Embedding dimension (model-dependent) "distance": "Cosine" # Cosine similarity for semantic search }, "optimizers_config": { "indexing_threshold": 10000 # Start indexing after 10k vectors }, "hnsw_config": { "m": 16, # Number of edges per node (balance speed/accuracy) "ef_construct": 100 # Quality of index construction } } # Payload schema (metadata) payload_schema = { "user_id": str, # Required: owner of content "content_type": str, # "note", "file", "calendar_event" "content_id": str, # Source ID (note_id, file_path, event_id) "title": str, # Searchable title "excerpt": str, # First 200 chars for preview "category": str, # Optional: category/folder "mime_type": str, # Optional: file MIME type "shared_with": list[str], # Optional: list of user_ids with access "tags": list[str], # Optional: user tags "created_at": int, # Unix timestamp "modified_at": int, # Unix timestamp "indexed_at": int # Unix timestamp (for sync tracking) } ``` #### 3.1 Multi-Tenancy via Filtering ```python # User-specific search with filtering search_results = await qdrant_client.search( collection_name="nextcloud_content", query_vector=query_embedding, query_filter=models.Filter( must=[ # User owns the content OR it's shared with them models.Filter( should=[ models.FieldCondition( key="user_id", match=models.MatchValue(value=current_user_id) ), models.FieldCondition( key="shared_with", match=models.MatchAny(any=[current_user_id]) ) ] ), # Optional: filter by content type models.FieldCondition( key="content_type", match=models.MatchValue(value="note") ) ] ), limit=20, score_threshold=0.7 # Only return confident matches ) ``` ### 4. Hybrid Search Implementation Combine semantic and keyword search for best results: ```python @mcp.tool() @require_scopes("notes:read") async def nc_notes_hybrid_search( query: str, ctx: Context, limit: int = 10, semantic_weight: float = 0.7, keyword_weight: float = 0.3 ) -> SearchNotesResponse: """ Hybrid search combining semantic understanding with keyword precision. Args: query: Natural language search query limit: Maximum results to return semantic_weight: Weight for semantic similarity (0-1) keyword_weight: Weight for keyword matching (0-1) """ client = get_client(ctx) username = client.username # Run searches in parallel semantic_task = asyncio.create_task( semantic_search(query, username, limit=limit * 2) ) keyword_task = asyncio.create_task( keyword_search(query, username, limit=limit * 2) ) semantic_results, keyword_results = await asyncio.gather( semantic_task, keyword_task ) # Fusion: Combine and rerank results fused_results = reciprocal_rank_fusion( semantic_results, keyword_results, semantic_weight=semantic_weight, keyword_weight=keyword_weight ) # Verify permissions via Nextcloud API (dual-phase authorization) verified_results = [] for result in fused_results[:limit * 2]: # Get extra for filtering try: note = await client.notes.get_note(result["note_id"]) verified_results.append({ "note": note, "score": result["score"], "match_type": result["match_type"] # "semantic", "keyword", "both" }) if len(verified_results) >= limit: break except HTTPStatusError as e: if e.response.status_code == 403: continue # User lost access raise return SearchNotesResponse( results=verified_results, query=query, total_found=len(verified_results), search_method="hybrid" ) def reciprocal_rank_fusion( semantic_results: list[dict], keyword_results: list[dict], semantic_weight: float = 0.7, keyword_weight: float = 0.3, k: int = 60 # RRF constant ) -> list[dict]: """ Reciprocal Rank Fusion for combining search results. RRF is more robust than score normalization because it only depends on ranks, not absolute scores. """ # Build rank maps semantic_ranks = {r["note_id"]: i for i, r in enumerate(semantic_results)} keyword_ranks = {r["note_id"]: i for i, r in enumerate(keyword_results)} # Get all unique note IDs all_note_ids = set(semantic_ranks.keys()) | set(keyword_ranks.keys()) # Calculate fused scores fused = [] for note_id in all_note_ids: # RRF formula: score = sum(weight_i / (k + rank_i)) semantic_score = 0 keyword_score = 0 match_type = [] if note_id in semantic_ranks: semantic_score = semantic_weight / (k + semantic_ranks[note_id]) match_type.append("semantic") if note_id in keyword_ranks: keyword_score = keyword_weight / (k + keyword_ranks[note_id]) match_type.append("keyword") fused.append({ "note_id": note_id, "score": semantic_score + keyword_score, "match_type": "+".join(match_type) }) # Sort by fused score fused.sort(key=lambda x: x["score"], reverse=True) return fused ``` ### 5. Document Chunking Strategy For large documents (>1000 tokens), implement semantic chunking: ```python class DocumentChunker: """Chunk large documents for optimal embedding""" def __init__(self, chunk_size: int = 512, overlap: int = 50): self.chunk_size = chunk_size # tokens self.overlap = overlap # overlapping tokens def chunk_document( self, content: str, metadata: dict ) -> list[tuple[str, dict]]: """ Split document into overlapping chunks with metadata. Returns list of (chunk_text, chunk_metadata) tuples. """ # Tokenize (approximate with words for simplicity) tokens = content.split() if len(tokens) <= self.chunk_size: # Document fits in single chunk return [(content, metadata)] chunks = [] start = 0 while start < len(tokens): end = start + self.chunk_size chunk_tokens = tokens[start:end] chunk_text = " ".join(chunk_tokens) # Add chunk metadata chunk_metadata = { **metadata, "chunk_index": len(chunks), "chunk_start": start, "chunk_end": end, "is_chunk": True } chunks.append((chunk_text, chunk_metadata)) # Move to next chunk with overlap start = end - self.overlap return chunks # Usage in sync worker async def index_document(doc: Document, user_id: str): """Index a document with chunking""" chunker = DocumentChunker(chunk_size=512, overlap=50) chunks = chunker.chunk_document( content=doc.content, metadata={ "user_id": user_id, "content_type": "file", "content_id": doc.path, "title": doc.title, "mime_type": doc.mime_type } ) # Generate embeddings in batch chunk_texts = [chunk[0] for chunk in chunks] embeddings = await embedding_service.embed_batch(chunk_texts) # Upsert all chunks points = [] for (chunk_text, chunk_metadata), embedding in zip(chunks, embeddings): points.append( models.PointStruct( id=str(uuid.uuid4()), vector=embedding, payload={ **chunk_metadata, "excerpt": chunk_text[:200] # Preview } ) ) await qdrant_client.upsert( collection_name="nextcloud_content", points=points ) ``` ### 6. Background Sync Worker ```python # nextcloud_mcp_server/sync/vector_indexer.py class VectorIndexer: """Indexes content into vector database""" def __init__( self, qdrant_client: AsyncQdrantClient, embedding_service: EmbeddingService, auth_provider: SyncAuthProvider # From ADR-002 ): self.qdrant = qdrant_client self.embeddings = embedding_service self.auth = auth_provider async def sync_user_notes(self, user_id: str): """Sync all notes for a user""" # Get authenticated client for user client = await self.auth.get_user_client(user_id) # Fetch all notes notes = await client.notes.list_notes() logger.info(f"Syncing {len(notes)} notes for {user_id}") # Check which notes need updating existing_ids = await self._get_indexed_note_ids(user_id) notes_to_update = [ n for n in notes if f"note_{n.id}" not in existing_ids or n.modified > existing_ids[f"note_{n.id}"] ] if not notes_to_update: logger.info(f"All notes up-to-date for {user_id}") return # Generate embeddings in batch contents = [f"{n.title}\n\n{n.content}" for n in notes_to_update] embeddings = await self.embeddings.embed_batch(contents) # Prepare points for upsert points = [] for note, embedding in zip(notes_to_update, embeddings): points.append( models.PointStruct( id=f"note_{note.id}", vector=embedding, payload={ "user_id": user_id, "content_type": "note", "content_id": str(note.id), "note_id": note.id, "title": note.title, "excerpt": note.content[:200], "category": note.category, "created_at": note.created, "modified_at": note.modified, "indexed_at": int(time.time()) } ) ) # Upsert to Qdrant await self.qdrant.upsert( collection_name="nextcloud_content", points=points ) logger.info(f"Indexed {len(points)} notes for {user_id}") async def _get_indexed_note_ids(self, user_id: str) -> dict[str, int]: """Get map of note_id -> modified_at for indexed notes""" # Query Qdrant for existing notes scroll_result = await self.qdrant.scroll( collection_name="nextcloud_content", scroll_filter=models.Filter( must=[ models.FieldCondition( key="user_id", match=models.MatchValue(value=user_id) ), models.FieldCondition( key="content_type", match=models.MatchValue(value="note") ) ] ), with_payload=["content_id", "modified_at"], limit=10000 ) return { point.payload["content_id"]: point.payload["modified_at"] for point, _ in scroll_result } async def delete_note(self, user_id: str, note_id: int): """Remove deleted note from index""" await self.qdrant.delete( collection_name="nextcloud_content", points_selector=models.FilterSelector( filter=models.Filter( must=[ models.FieldCondition( key="user_id", match=models.MatchValue(value=user_id) ), models.FieldCondition( key="note_id", match=models.MatchValue(value=note_id) ) ] ) ) ) ``` ### 7. Configuration #### 7.1 Environment Variables ```bash # Vector Database QDRANT_URL=http://qdrant:6333 QDRANT_API_KEY=<secure-api-key> QDRANT_COLLECTION=nextcloud_content # Embedding Strategy (choose one) # Option 1: OpenAI OPENAI_API_KEY=sk-... OPENAI_EMBEDDING_MODEL=text-embedding-3-small # or text-embedding-3-large # Option 2: Self-hosted EMBEDDING_SERVICE_URL=http://embeddings:7997 EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 # Option 3: Local (fallback, no config needed) # Search Configuration SEMANTIC_SEARCH_ENABLED=true HYBRID_SEARCH_DEFAULT_SEMANTIC_WEIGHT=0.7 HYBRID_SEARCH_DEFAULT_KEYWORD_WEIGHT=0.3 SEARCH_SCORE_THRESHOLD=0.7 # Sync Configuration VECTOR_SYNC_INTERVAL=300 # seconds VECTOR_SYNC_BATCH_SIZE=100 ``` #### 7.2 Docker Compose ```yaml services: # Vector Database qdrant: image: qdrant/qdrant:latest restart: always ports: - 127.0.0.1:6333:6333 # REST API - 127.0.0.1:6334:6334 # gRPC volumes: - qdrant_storage:/qdrant/storage environment: - QDRANT__SERVICE__API_KEY=${QDRANT_API_KEY} - QDRANT__SERVICE__HTTP_PORT=6333 - QDRANT__SERVICE__GRPC_PORT=6334 # Embedding Service (optional - for self-hosted) embeddings: image: michaelf34/infinity:latest restart: always ports: - 127.0.0.1:7997:7997 volumes: - embedding_models:/app/.cache environment: - MODEL_ID=BAAI/bge-small-en-v1.5 - BATCH_SIZE=32 - ENGINE=torch # or optimum for better CPU performance # Optional: GPU support deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] # MCP Server with vector search mcp: build: . command: ["--transport", "streamable-http"] depends_on: - app - qdrant - embeddings # optional environment: # ... existing env vars ... - SEMANTIC_SEARCH_ENABLED=true - QDRANT_URL=http://qdrant:6333 - QDRANT_API_KEY=${QDRANT_API_KEY} # Choose embedding strategy - EMBEDDING_SERVICE_URL=http://embeddings:7997 # OR # - OPENAI_API_KEY=${OPENAI_API_KEY} # Vector Sync Worker mcp-vector-sync: build: . command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"] depends_on: - app - qdrant - embeddings # optional environment: # Nextcloud + Auth (from ADR-002) - NEXTCLOUD_HOST=http://app:80 - ENABLE_OFFLINE_ACCESS=true - TOKEN_ENCRYPTION_KEY=${TOKEN_ENCRYPTION_KEY} # Vector Database - QDRANT_URL=http://qdrant:6333 - QDRANT_API_KEY=${QDRANT_API_KEY} # Embeddings - EMBEDDING_SERVICE_URL=http://embeddings:7997 volumes: - sync-tokens:/app/data volumes: qdrant_storage: embedding_models: sync-tokens: ``` ### 8. Performance Optimization #### 8.1 Indexing Performance ```python # Batch embedding generation async def embed_batch_chunked( texts: list[str], batch_size: int = 100 ) -> list[list[float]]: """Generate embeddings in chunks to avoid memory issues""" embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] batch_embeddings = await embedding_service.embed_batch(batch) embeddings.extend(batch_embeddings) await asyncio.sleep(0.1) # Rate limiting return embeddings # Parallel upsert with batching async def upsert_points_batched( points: list[models.PointStruct], batch_size: int = 100 ): """Upsert points in batches""" for i in range(0, len(points), batch_size): batch = points[i:i + batch_size] await qdrant_client.upsert( collection_name="nextcloud_content", points=batch, wait=False # Don't wait for indexing ) ``` #### 8.2 Search Performance ```python # Search with prefetch for better accuracy search_results = await qdrant_client.search( collection_name="nextcloud_content", query_vector=query_embedding, query_filter=user_filter, limit=20, with_payload=True, with_vectors=False, # Don't return vectors (saves bandwidth) search_params=models.SearchParams( hnsw_ef=128, # Higher = more accurate but slower exact=False # Use HNSW index ) ) ``` #### 8.3 Caching ```python # Cache embeddings for common queries from functools import lru_cache @lru_cache(maxsize=1000) def cache_key(text: str) -> str: return hashlib.sha256(text.encode()).hexdigest() async def embed_with_cache(text: str) -> list[float]: """Generate embedding with caching""" key = cache_key(text) # Check Redis cache cached = await redis.get(f"embedding:{key}") if cached: return json.loads(cached) # Generate embedding embedding = await embedding_service.embed(text) # Cache for 1 hour await redis.setex( f"embedding:{key}", 3600, json.dumps(embedding) ) return embedding ``` ### 9. Monitoring and Metrics ```python # Prometheus metrics from prometheus_client import Counter, Histogram, Gauge # Search metrics semantic_search_count = Counter( 'semantic_search_total', 'Total semantic searches', ['user_id', 'content_type'] ) semantic_search_latency = Histogram( 'semantic_search_duration_seconds', 'Semantic search latency', ['phase'] # 'embedding', 'vector_search', 'verification' ) # Indexing metrics documents_indexed = Counter( 'documents_indexed_total', 'Total documents indexed', ['user_id', 'content_type'] ) index_queue_size = Gauge( 'index_queue_size', 'Number of documents waiting to be indexed' ) # Usage async def semantic_search(query: str, user_id: str): semantic_search_count.labels(user_id=user_id, content_type='note').inc() with semantic_search_latency.labels(phase='embedding').time(): embedding = await embed(query) with semantic_search_latency.labels(phase='vector_search').time(): results = await qdrant.search(...) with semantic_search_latency.labels(phase='verification').time(): verified = await verify_access(results) return verified ``` ## Consequences ### Benefits 1. **Semantic Understanding** - Find content by meaning, not just keywords - Support for natural language queries - Cross-lingual search potential - Better context discovery for LLMs 2. **User Experience** - More relevant search results - Discover related content across apps - Fast sub-second query latency - Hybrid search combines best of both worlds 3. **Architecture** - External sidecar (doesn't bloat MCP server) - Configurable embedding backend (cloud/self-hosted/local) - Multi-tenant with strict isolation - Scales horizontally (Qdrant cluster) 4. **Privacy & Security** - Self-hosted option available - Dual-phase authorization enforces permissions - Vector DB is cache, not source of truth - Per-user audit trail 5. **Developer Experience** - Simple async Python API - Comprehensive monitoring - Clear upgrade path (better embeddings, reranking) ### Limitations 1. **Complexity** - Additional infrastructure (Qdrant + embeddings) - More monitoring required - Embedding generation latency - Initial indexing time for large collections 2. **Cost** - Storage: ~4KB per document (embedding + metadata) - Compute: Embedding generation (API costs or GPU) - Memory: Qdrant keeps vectors in RAM for speed 3. **Operational** - Index maintenance and updates - Embedding model versioning - Handling deleted/moved content - Cold start indexing for new users 4. **Search Accuracy** - Quality depends on embedding model - May miss exact keyword matches (mitigated by hybrid search) - Cultural/domain-specific terms may not embed well - Requires tuning score thresholds ### Performance Characteristics | Metric | Target | Notes | |--------|--------|-------| | Search latency | <200ms | Embedding + vector search + verification | | Indexing throughput | >100 docs/sec | With batch embeddings | | Memory per 10k docs | ~40MB | Qdrant vectors + metadata | | Disk per 10k docs | ~40MB | Persistent storage | | Search accuracy | >90% | At score_threshold=0.7 | ### Cost Estimates **Small Deployment** (10 users, 1000 notes each): - Initial indexing: 10,000 notes × $0.00002 = $0.20 (OpenAI) - Monthly searches: 1000 queries × $0.00002 = $0.02 - Infrastructure: Qdrant (40MB RAM), Embeddings (optional) - **Total**: ~$0.25/month (API) or self-hosted (negligible) **Medium Deployment** (100 users, 500 notes each): - Initial indexing: 50,000 notes × $0.00002 = $1.00 - Monthly searches: 10,000 queries × $0.00002 = $0.20 - Infrastructure: Qdrant (200MB RAM) - **Total**: ~$1.20/month or self-hosted **Self-Hosted** (any size): - GPU instance: ~$0.50/hour (~$360/month for 24/7) - Or CPU-only: negligible cost, slower embeddings ### Future Enhancements 1. **Multimodal Search** - Image embeddings (CLIP) - PDF/document layout understanding - Audio transcription + embedding 2. **Advanced Ranking** - Cross-encoder reranking - Learning-to-rank models - User feedback signals 3. **Query Understanding** - Query expansion - Spell correction - Entity extraction 4. **Performance** - Query result caching - Approximate nearest neighbor improvements - Quantization for reduced memory 5. **Features** - Saved searches - Search analytics - Recommended content ## Alternatives Considered ### Alternative 1: Elasticsearch/OpenSearch **Approach**: Use traditional full-text search engine with vector plugin **Pros**: - Mature ecosystem - Excellent keyword search - Rich query DSL **Cons**: - Heavy infrastructure (JVM-based) - Complex setup and tuning - Vector search is plugin/add-on (not native) - Higher resource usage **Decision**: Rejected; Qdrant is purpose-built for vectors ### Alternative 2: ChromaDB **Approach**: Embedded or client-server vector database **Pros**: - Simple Python API - Easy to get started - Good for prototyping **Cons**: - Sync-only Python client (no async) - Limited multi-tenancy features - Less mature than Qdrant - Scaling concerns **Decision**: Rejected; async and multi-tenancy are critical ### Alternative 3: Weaviate **Approach**: Full-featured vector database with GraphQL **Pros**: - Very feature-rich - Built-in vectorization - Good documentation **Cons**: - More complex architecture - Higher resource usage - GraphQL adds complexity - Overkill for our use case **Decision**: Rejected; Qdrant provides better balance ### Alternative 4: pgvector (PostgreSQL Extension) **Approach**: Add vector search to existing PostgreSQL **Pros**: - Leverages existing PostgreSQL expertise - Transactional consistency - Mature database ecosystem **Cons**: - This deployment uses MariaDB (would need PostgreSQL) - Performance not as optimized as purpose-built vector DB - Manual hybrid search implementation - HNSW index limitations **Decision**: Rejected; dedicated vector DB is better fit ### Alternative 5: Pinecone / Vertex AI Vector Search **Approach**: Managed cloud vector database **Pros**: - Fully managed - Excellent performance - No infrastructure management **Cons**: - Cloud-only (no self-hosting) - Recurring costs - Vendor lock-in - Data leaves premises **Decision**: Rejected; self-hosting is important for privacy ## Related Decisions - ADR-001: Enhanced Note Search (establishes need for better search) - ADR-002: Vector Sync Authentication (defines how sync workers authenticate) - [Future] ADR-004: Content Extraction and Document Processing - [Future] ADR-005: Cross-App Semantic Search ## References - [Qdrant Documentation](https://qdrant.tech/documentation/) - [Sentence Transformers](https://www.sbert.net/) - [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings) - [Hybrid Search with RRF](https://qdrant.tech/articles/hybrid-search/) - [HNSW Algorithm](https://arxiv.org/abs/1603.09320) - [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/No-Smoke/nextcloud-mcp-comprehensive'

If you have feedback or need assistance with the MCP directory API, please join our Discord server