AI Embedding with Vector Database

Dhiraj Patra
9 min readNov 25, 2023

--

Photo by Karolina Grabowska

Embedding, in the context of machine learning and natural language processing, refers to the representation of objects, such as words or sentences, in a continuous vector space. The goal of embedding is to capture semantic relationships, similarities, and contextual information between words or entities, making it easier for machine learning models to understand and process them. Here’s a breakdown of embedding with examples,

categories, and context:

Embeddings, in the realm of natural language processing, serve as numerical representations that gauge the interconnectedness of text strings. These embeddings find versatile applications, including:

1. Search: Ranking results based on their relevance to a given query string.

2. Clustering: Grouping text strings together based on their similarity.

3. Recommendations: Recommending items with text strings closely related to the user’s preferences.

4. Anomaly Detection: Identifying outliers with minimal textual relatedness.

5. Diversity Measurement: Analyzing distributions of similarity to assess diversity.

6. Classification: Categorizing text strings by their closest-matching label.

Essentially, an embedding is a list of floating-point numbers arranged in a vector. The degree of relatedness between two vectors is determined by the distance between them. Short distances signify high relatedness, while longer distances indicate lower relatedness. OpenAI’s text embeddings play a crucial role in various tasks, facilitating a nuanced understanding of textual relationships and enabling applications ranging from search algorithms to anomaly detection.

Types of Embeddings:

Word Embeddings:

Definition: Word embeddings represent words as vectors in a continuous vector space, where
semantically similar words are closer together.
Example: Word2Vec, GloVe, FastText.
Context: Useful for various NLP tasks like sentiment analysis, machine translation, and named entity recognition.
Definition: Sentence embeddings capture the overall meaning of a sentence in a continuous vector representation.
Example: Universal Sentence Encoder, BERT embeddings.
Context: Beneficial for tasks like document similarity, text classification, and clustering.
Document Embeddings:
Definition: Document embeddings represent the entire document as a vector, summarizing its content.
Example: Doc2Vec, BERT-based document embeddings.
Context: Useful for tasks like document retrieval, topic modeling, and document clustering.
Entity Embeddings:
Definition: Entity embeddings represent entities (e.g., products, users) as vectors, capturing their features and relationships.
Example: Embeddings for product recommendations, user embeddings.
Context: Applied in collaborative filtering, recommendation systems, and knowledge graph embeddings.
Categories of Embeddings:
Pre-trained Embeddings:
Definition: Embeddings trained on large corpora and then used as a starting point for specific tasks.
Example: Word2Vec, GloVe pre-trained embeddings.
Context: Saves computational resources and is effective for downstream tasks with limited data.
Contextual Embeddings:
Definition: Embeddings that consider the context of words or entities in a sentence.
Example: BERT, ELMo.
Context: Captures nuances and context-specific meanings in natural language.
Domain-specific Embeddings:
Definition: Embeddings trained on domain-specific data, catering to the unique characteristics of a particular field.
Example: Medical embeddings, legal document embeddings.
Context: Improves performance on tasks within a specific domain.
Contextual Examples:
Word Embeddings:
Context: In sentiment analysis, the word “happy” and “joyful” should have similar embeddings as they
convey positive sentiments.
Sentence Embeddings:
Context: For document clustering, sentence embeddings should reflect the overall theme of a document,
helping group similar documents.
Entity Embeddings:
Context: In e-commerce, embeddings for similar products should be close in vector space, aiding recommendation
systems.
Contextual Embeddings:
Context: In machine translation, contextual embeddings help capture the different meanings of words in source and
target languages.
Sentence Embeddings:
Embeddings play a crucial role in enhancing the capabilities of machine learning models to understand and process complex relationships within data. They have become a fundamental component in various natural language processing applications, contributing to the success of many state-of-the-art models.
Here is a general example of how you might approach using GPT-3 for text-related tasks, including obtaining embeddings:

1. API Setup:

- Obtain API credentials from the OpenAI platform.

- Install the OpenAI Python library (if not installed already).

```bash

pip install openai

```

2. Example Code:

- Use the OpenAI API to generate embeddings.

python

import openai
# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'
# Your input text for which you want embeddings
input_text = "A sample text for embedding generation."
# Use the OpenAI API to get embeddings
response = openai.Completion.create(
engine="text-davinci-003",  # Choose the appropriate engine
prompt=input_text,
max_tokens=50,  # Adjust as needed
n=1,  # Number of completions
stop=None  # Custom stopping criteria
)
# Extract the generated text or embeddings from the response
embedding = response['choices'][0]['text']
# Do something with the obtained embeddings
print(embedding)

You can get the latest documentation here https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

I am here to provide an example from the above page which is reading the data from data source.

python

from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return client.embeddings.create(input = [text], model=model)['data'][0]['embedding']
df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

Vector database

A vector database is a type of database that stores data as vectors of numbers. This makes it possible to efficiently perform operations on vectors, such as similarity searches and clustering. Vector databases are becoming increasingly popular for applications that require real-time search and analysis of large amounts of data, such as natural language processing and image recognition.

One example of a vector database is PostgreSQL with the pgvector extension. This extension allows users to store and query vectors of numbers directly in PostgreSQL. The pgvector extension is compatible with all PostgreSQL data types, and it can be used to perform a variety of operations on vectors, such as similarity searches, clustering, and dimensionality reduction.

Another example of a vector database is Milvus. Milvus is a high-performance, open-source vector database that is designed for large-scale similarity search. Milvus is used by a variety of companies, including Facebook, Twitter, and Airbnb.

OpenAI embeddings are a type of vector embedding that is created by a neural network. OpenAI embeddings can be used to represent text, images, and other types of data as vectors of numbers. OpenAI embeddings can be used with vector databases to perform similarity searches and clustering.

To use OpenAI embeddings with PostgreSQL vector, you will need to first install the OpenAI API and the pgvector extension. You will also need to create a table in PostgreSQL to store your OpenAI embeddings. Once you have done this, you can use the pgvector extension to perform similarity searches and clustering on your OpenAI embeddings.

Here is an example of how to use OpenAI embeddings with PostgreSQL vector to perform a similarity search:

SQL

SELECT *
FROM my_table
WHERE vector_column @@ vector_query;

This query will return all rows in the my_table table whose vector_column is similar to the vector_query.

Embeddings in Practice: Considerations for Large-Scale Applications

While embedding techniques have gained significant traction in the realm of natural language processing and machine learning, their practical implementation often poses challenges, especially when dealing with large datasets and complex workflows. Storing embeddings in a simple CSV file and performing similarity calculations using Python libraries may suffice for small-scale projects. However, as the volume and complexity of data grow, limitations arise:

  1. Scalability: Storing and managing embeddings in a CSV file become unwieldy and inefficient as the number of documents and embeddings increases. This approach is not scalable to handle large-scale datasets, which require more robust and performant storage solutions.
  2. Dynamic Embeddings: Embeddings are often dynamic entities that need to be updated or deleted as new information emerges or old information becomes irrelevant. Manually managing these changes in a CSV file is not only cumbersome but also prone to errors.
  3. Language Agnosticism: Python, while a powerful tool, is not the only language used for data science and machine learning tasks. Embeddings should be accessible and usable across different programming languages and environments to cater to a broader range of users.

To address these limitations, consider employing dedicated embedding storage solutions such as vector databases or embedding indices. These solutions offer several advantages over CSV files:

  1. Efficient Storage and Retrieval: Vector databases are optimized for storing and retrieving vectors of numbers, making them ideal for managing large collections of embeddings. They provide efficient search and retrieval capabilities, enabling quick identification of similar embeddings.
  2. Dynamic Updates: Vector databases and embedding indices support dynamic updates, allowing for seamless insertion, deletion, and modification of embeddings as needed. This eliminates the need for manual manipulation of CSV files, reducing the risk of errors and streamlining the workflow.
  3. Language Interoperability: Many vector databases and embedding indices offer APIs and connectors that support various programming languages, including Python, Java, and C++. This enables developers to work with embeddings in their preferred language environment.

In addition to employing appropriate storage solutions, consider using embedding APIs or frameworks that provide high-level abstractions for working with embeddings. These tools can simplify common embedding tasks, such as similarity search, clustering, and dimensionality reduction, making it easier to integrate embeddings into your applications.

By carefully considering these factors and choosing the right tools for your specific needs, you can effectively leverage embeddings in large-scale applications, unlocking their full potential to enhance your machine learning and natural language processing tasks.

You can use PGVector opensource library available on GitHub here https://github.com/pgvector/pgvector

Harnessing PostgreSQL for Vector Embedding Storage and Retrieval

Leveraging the power of PostgreSQL, the pgvector extension enables seamless storage and retrieval of vector embeddings directly within the database. Embark on a hands-on exploration of this extension.

Enabling the Vector Extension

Commence by activating the Vector extension. In Supabase, access the web portal and navigate to Database → Extensions. Alternatively, execute the following SQL command:

SQL

create extension vector;

Creating a Table for Document and Embedding Storage

Construct a table to accommodate documents and their corresponding embeddings:

SQL

create table documents (
id bigserial primary key,
content text,
embedding vector(1536)
);

The pgvector extension introduces a novel data type called vector. In the provided code, a column named embedding is created with the vector data type. The specified vector size determines the number of dimensions the vector encompasses. Since OpenAI’s text-embedding-ada-002 model generates 1536 dimensions, this value is utilized for the vector size.

Additionally, a text column named content is created to store the original document text that produced the embedding. Depending on the specific use case, a reference (URL or foreign key) to a document could be stored instead.

Introducing a Function for Similarity Search

To perform similarity searches over these embeddings, a dedicated function is crafted:

SQL

create or replace function match_documents (
query_embedding vector(1536),
match_threshold float,
match_count int
)
returns table (
id bigint,
content text,
similarity float
)
language sql stable
as $$
select
documents.id,
documents.content,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > match_threshold
order by similarity desc
limit match_count;
$$;

The pgvector extension introduces three new operators for calculating similarity:

OperatorDescription<->Euclidean distance<#>Negative inner product<=>Cosine distance

OpenAI recommends employing cosine similarity for their embeddings, so that approach is adopted here.

Invoking the match_documents() Function

The match_documents() function can now be invoked by providing the embedding, similarity threshold, and match count. This will return a list of all documents that match the specified criteria. Since Postgres manages this process, the application code remains remarkably simple.

Indexing for Enhanced Performance

As the embedding table grows, consider adding an index to accelerate queries. Vector indexes are particularly crucial when ordering results because vectors are not grouped by similarity, making finding the closest match using a sequential scan resource-intensive.

Each distance operator necessitates a specific index type. Since ordering by cosine distance is desired, the vector_cosine_ops index is employed. A suitable starting value for the number of lists is 4 * sqrt(table_rows):

SQL

create index on documents using ivfflat (embedding vector_cosine_ops)
with
(lists = 100);

For further details on indexing, consult the pgvector GitHub page.

You can get more details about Python lib of the pgvector here https://github.com/pgvector/pgvector-python

Here is a Python code for generating embeddings and storing them in Postgres:

Python

import openai
import psycopg2
def generate_embeddings():
# Replace with your OpenAI API key
openai.api_key = "<YOUR_OPENAI_KEY>"
# Connect to your PostgreSQL database
conn = psycopg2.connect(
host="<YOUR_DATABASE_HOST>",
database="<YOUR_DATABASE_NAME>",
user="<YOUR_DATABASE_USERNAME>",
password="<YOUR_DATABASE_PASSWORD>",
)
# Create a cursor for executing SQL queries
cursor = conn.cursor()
# Prepare a SQL query to insert documents and embeddings
insert_query = """
INSERT INTO documents (content, embedding)
VALUES (%s, %s)
"""
# Load documents from your custom function
documents = get_documents()
# Iterate over the documents
for document in documents:
# Replace newlines with spaces for better embedding results
input = document.replace("\n", " ")
# Generate an embedding for the document
embedding = openai.Embeddings.create(
model="text-embedding-ada-002",
input=input,
)
# Extract the embedding vector
embedding_vector = embedding["data"][0]["embedding"]
# Insert the document and embedding into the database
cursor.execute(insert_query, (document, embedding_vector))
# Commit the changes to the database
conn.commit()
# Close the database connection
conn.close()
# Call the function to generate embeddings
generate_embeddings()

You can get more details about vector databases here https://www.ibm.com/topics/vector-database

--

--

Dhiraj Patra

AI Strategy, Generative AI, AI & ML Consulting, Product Development, Startup Advisory, Data Architecture, Data Analytics, Executive Mentorship, Value Creation