VectorDB Tutorial — A Beginner’s Guide
A Vector Database (VectorDB) is designed to store and manage vector data, often used in machine learning and AI applications. Vector data refers to numerical representations of objects, which can be used for similarity search, clustering, and other tasks.
Core Concepts of VectorDB
Vector Representations:
Similarity Search:
Indexing:
Dimensionality Reduction:
VectorDB Architecture
Data Ingestion Layer:
Indexing Layer:
Storage Layer:
Query Processing Layer:
API Layer:
Use Cases of VectorDB
Recommendation Systems:
Image and Video Retrieval:
Natural Language Processing (NLP):
Anomaly Detection:
Biometric Identification:
Genomics and Bioinformatics:
Example Technologies and Tools
FAISS (Facebook AI Similarity Search):
Annoy (Approximate Nearest Neighbors Oh Yeah):
HNSW (Hierarchical Navigable Small World):
Milvus:
Recommended by LinkedIn
Steps to Use VectorDB
1. Install the Required Libraries
You’ll need a library like faiss (Facebook AI Similarity Search) or Annoy (Approximate Nearest Neighbors Oh Yeah) to work with vectors. Let's use faiss for this example.
pip install faiss-cpu
2. Import the Library
import faiss
import numpy as np
3. Create Some Vector Data
Let’s create some sample vectors.
# Generate random vectors
d = 128 # Dimension of vectors
nb = 1000 # Number of vectors
np.random.seed(1234)
vectors = np.random.random((nb, d)).astype('float32')
4. Build the Index
We’ll build an index to store these vectors and make similarity searches efficient.
index = faiss.IndexFlatL2(d) # L2 distance
index.add(vectors) # Add vectors to the index
print(f"Number of vectors in the index: {index.ntotal}")
5. Perform a Similarity Search
Now, let’s search for the top 5 vectors similar to a query vector.
# Generate a random query vector
query_vector = np.random.random((1, d)).astype('float32')
# Perform the search
k = 5 # Number of nearest neighbors
distances, indices = index.search(query_vector, k)
print("Nearest neighbors:")
print(indices)
print("Distances:")
print(distances)
6. Advanced Usage: Adding and Removing Vectors
You can add more vectors to the index or remove vectors if needed.
# Add more vectors
more_vectors = np.random.random((500, d)).astype('float32')
index.add(more_vectors)
# Remove vectors by their indices (only supported in some index types)
# Example with an ID-mapped index
index_with_ids = faiss.IndexIDMap(index)
index_with_ids.add_with_ids(more_vectors, np.arange(nb, nb + 500))
# To remove vectors, create a new index without the unwanted vectors
indices_to_remove = np.array([0, 1, 2], dtype='int64')
mask = np.isin(np.arange(index.ntotal), indices_to_remove, invert=True)
vectors_to_keep = vectors[mask]
index = faiss.IndexFlatL2(d)
index.add(vectors_to_keep)
Output
Here is a sample output that you can see when you execute each step:
Initial number of vectors: 1000
Nearest neighbors for a query vector: Indices [[729 594 134 349 870]], Distances [[17.123356 17.326893 17.376125 17.495504 17.497896]]
Number of vectors after adding more: 1500
Number of vectors after removal: 997
You’ve now learned the basics of using a Vector Database with faiss. This tutorial covered creating vectors, building an index, and performing similarity searches. You can extend this knowledge to more complex tasks and larger datasets.
For more advanced tutorials on embeddings and VectorDB refer to my Github:
Relationship Between Embeddings and VectorDB
Storage and Retrieval:
Indexing:
Similarity Search:
Applications:
Example Workflow
Generate Embeddings:
Store Embeddings in VectorDB:
Query and Retrieve Similar Items:
Embeddings and VectorDBs work hand in hand to enable efficient storage, management, and retrieval of high-dimensional vector data. Embeddings provide a powerful way to represent data in a numerical format, while VectorDBs offer the necessary tools to manage and search these embeddings at scale, making them crucial components in modern AI and machine learning applications.
Follow me to gain more knowledge on LLM!