How is vector search different from Elasticsearch full-text search?

Elasticsearch uses BM25, a probabilistic keyword relevance algorithm: it scores documents based on term frequency and inverse document frequency. It is excellent for exact keyword matches and scales to billions of documents, but it cannot understand semantic similarity — a query for "tensile strength" will not retrieve documents that only use "ultimate stress" or "breaking load." Vector search encodes meaning geometrically: semantically similar text is close in vector space regardless of exact wording. Modern production systems combine both (Elasticsearch 8.x supports kNN vector fields natively), getting the semantic power of embeddings with the exact-match reliability of BM25.

How many documents can a vector database handle, and what does it cost?

Pinecone's serverless tier handles collections up to 10 billion vectors. A typical engineering firm with 100,000 documents averaging 20 pages generates roughly 4–8 million chunks — well within any production vector database. Cost estimate: embedding 5 million chunks with text-embedding-3-small costs $1 (OpenAI pricing). Storage in Pinecone serverless costs approximately $0.5–$2/month for a 5M-vector collection. Query costs are $4 per million queries. For a firm searching 10,000 times per month, total monthly cost is well under $50 — the main cost driver is engineering time to build the ingestion pipeline.

Can vector search work with scanned PDFs and legacy engineering documents?

Yes, with an OCR preprocessing step. Scanned PDFs need optical character recognition before text can be extracted and embedded. Azure Document Intelligence (cloud, $1.50 per 1,000 pages), AWS Textract, and Google Document AI are the most accurate commercial options for complex engineering layouts with tables, diagrams, and handwritten annotations. Open-source Tesseract 5 is free but less accurate on engineering drawings and degraded scans. OCR accuracy typically drops for hand-drawn dimensions, stamp text, and multi-column table layouts — plan for manual correction for high-priority documents.

Should we self-host our vector database or use a managed cloud service?

For most engineering firms, a managed service (Pinecone, Weaviate Cloud, Qdrant Cloud) is the right choice: no infrastructure maintenance, automatic scaling, and SLA-backed uptime. Self-hosting Qdrant or Weaviate via Docker or Kubernetes is appropriate when: (1) data residency requirements prohibit cloud storage of project documents; (2) you have high query volume that makes cloud pricing unfavorable at scale; (3) your firm has an existing Kubernetes environment and DevOps capability. pgvector in a self-managed PostgreSQL database is the simplest self-hosted option for smaller corpora and teams already managing PostgreSQL.

How do we keep the vector index current as documents are revised?

Vector databases support upsert operations: when a document is revised, delete old chunks by document ID metadata filter and insert new chunks. This is typically automated in the document ingestion pipeline: (1) a file watcher or SharePoint webhook triggers on document change events; (2) the pipeline re-processes the changed document; (3) old vectors are deleted and new vectors are inserted. Revision metadata (revision number, issue date) stored in chunk metadata allows the retrieval system to prefer the latest revision. For archived documents, you may want to retain old revisions for historical queries — filter by is_current_revision: true in production queries and allow explicit revision queries for historical research.

AI & Automation·9 min read·September 1, 2025

🤖 Vector Databases and Semantic Search for Engineering Document Retrieval

A deep dive into vector databases — how they work, which platforms to choose, and how to build semantic search over engineering specs, standards, and project archives that outperforms keyword search by orders of magnitude.

Why Keyword Search Fails Engineering Document Libraries

Every engineering firm has a document management problem: decades of project files, specifications, calculation packages, and standards — all nominally searchable, none practically findable. Traditional keyword search (based on exact string matching or BM25 TF-IDF scoring) fails because engineering documents use inconsistent terminology. "High-strength bolt," "structural bolt," "A325 bolt," and "F3125 Grade A325 bolt" all refer to the same fastener. A keyword search for one finds none of the others. An engineer spends an hour hunting before giving up and reinventing from scratch.

Semantic search using vector embeddings solves this: queries and documents are encoded into dense numerical vectors in a high-dimensional space where semantic similarity corresponds to geometric proximity. "tensile bolt capacity" retrieves documents about "bolt strength in tension" even though no word overlaps. This single capability transforms document management from a filing problem into a knowledge retrieval asset.

How Vector Embeddings Work

An embedding model (a transformer neural network) converts text into a vector of floating-point numbers — typically 768 to 3072 dimensions. The model is trained so that semantically similar texts produce vectors that are close together in this high-dimensional space, measured by cosine similarity or dot product.

For engineering documents, the most effective embedding models are:

text-embedding-3-large (OpenAI, 3072-dim): best overall quality for English-language engineering text; $0.13 per million tokens.
bge-large-en-v1.5 (Beijing Academy of AI, open source, 1024-dim): top open-source performer on MTEB benchmark; runs on a single GPU or CPU.
e5-mistral-7b-instruct (Microsoft, open source, 4096-dim): strong on long-document retrieval; excellent for multi-page specification sections.
cohere-embed-v3 (Cohere): competitive with OpenAI with built-in multi-lingual support — useful for international projects with mixed-language documents.

Comparing Vector Database Platforms

The market has consolidated around several production-ready options:

Pinecone: fully managed, serverless vector database. Fastest time-to-deployment; no infrastructure management. Supports metadata filtering, hybrid search (BM25 + dense), and 10B+ vector scale. Best for firms that want operational simplicity. Cost: ~$0.096/GB/month for storage + query pricing.
Weaviate: open-source, self-hostable or cloud. Strong multi-modal support (text + image vectors in the same index). GraphQL and REST APIs. Well-suited for engineering firms with mixed document and image corpora (drawings + reports). BM25 hybrid search built in.
Qdrant: open-source, Rust-based (high performance). Excellent filtering efficiency — can filter by project ID, document type, and revision date at query time without performance penalty. gRPC API for low-latency applications.
pgvector: PostgreSQL extension adding vector column types and ANN index (HNSW or IVFFlat). Best choice if your document metadata is already in PostgreSQL — single database for relational and vector data, no new infrastructure.
Chroma: lightweight, open-source, embeds in Python. Ideal for prototyping and small corpora (under 1M documents). Not production-hardened at scale.
Milvus: high-performance distributed vector database, open source. Handles billion-scale collections; used by large enterprises with massive document archives.

Hybrid Search: The Production Standard

Pure semantic search (dense retrieval only) outperforms keyword search for conceptual queries but can miss exact technical terms — standard numbers (ASTM A36, AISC W18x97), drawing numbers, and specific values where exact match matters. Production engineering document search systems use hybrid search: combining BM25 keyword relevance with dense vector similarity scores using Reciprocal Rank Fusion (RRF) or a learned linear combination.

Hybrid search consistently outperforms either method alone on engineering document benchmarks. Pinecone, Weaviate, and Elasticsearch (with the knn field type) all support hybrid retrieval natively. For custom implementations, FAISS (Facebook AI Similarity Search) can be combined with BM25 via a re-ranking step.

Metadata Filtering: Essential for Engineering Document Management

Engineering documents require strict access control and scoped retrieval. A query about "compressive strength specification" should only return documents from the relevant project and discipline. Vector databases support metadata pre-filtering or post-filtering:

Pre-filtering (recommended): restrict the candidate set by metadata (project ID, document type, revision status, discipline) before ANN search. Qdrant and Weaviate implement this efficiently with payload indexes.
Post-filtering: run ANN search across all vectors, then discard non-matching metadata. Simpler to implement but wastes compute and can return fewer-than-k results when many candidates are filtered out.

Metadata schema for an engineering document system should include: project number, document type (spec, drawing, calculation, report), discipline (structural, electrical, mechanical, civil), revision number, issue date, author, and access control group.

Building a Semantic Search MVP for Your Engineering Firm

A minimal viable product can be built in a weekend using open-source tools:

Ingest: extract text from PDFs with PyMuPDF (fitz); chunk into 400-token segments with 50-token overlap using LangChain's RecursiveCharacterTextSplitter.
Embed: batch-embed chunks using bge-large-en-v1.5 via the sentence-transformers library. Free, runs locally.
Index: insert embeddings + metadata into Chroma (local) or Qdrant (self-hosted via Docker).
Query interface: build a simple Streamlit or Gradio web app: text box → embed query → ANN search → return top-5 chunks with document source and page number.
Upgrade path: swap Chroma for Pinecone as corpus grows; add LLM answer generation layer (RAG) once retrieval quality is validated.

Re-ranking for Higher Precision

Initial vector retrieval returns top-k candidates quickly but is not perfectly ranked. A re-ranking step using a cross-encoder model (which jointly encodes query + candidate for higher accuracy) improves precision significantly:

Retrieve top-20 candidates with ANN search (fast)
Re-rank with cross-encoder/ms-marco-MiniLM-L-6-v2 or Cohere Rerank API (more accurate)
Return top-5 re-ranked results to the user

Re-ranking adds 100–300 ms latency but typically improves Mean Reciprocal Rank (MRR) by 15–25% on engineering document benchmarks. Cohere Rerank and Jina Reranker v2 are the leading commercial options.

Topics covered

vector databasesemantic searchPineconeWeaviateQdrantpgvectorembeddingsANN searchFAISSChromaMilvusengineering document managementsimilarity searchdense retrievalhybrid searchBM25

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps