Leveraging Embeddings and Vector Databases to Improve ChatGPT Responses

Understanding the world of AI can sometimes feel like learning a new language. One term you may come across is 'embeddings.' Embeddings are essentially numerical representations of data whether that be words, images, or anything else that AI models use to understand the complexities and nuances of our world. This article aims to unpack the concept of embeddings, explain why and how a vector database could be useful, and dive into the specifics of vector use and cosine similarity, all in terms we can all comprehend.

Embeddings: An Overview

First things first, what exactly are embeddings? In machine learning, embeddings are a way to convert discrete variables or data such as words into continuous vectors. Essentially, they're a way to transform a large, high-dimensional dataset into a lower-dimensional, numeric format that a machine can understand and process. Each element of the vector captures a bit of the item's 'meaning' in the dataset.

For instance, in the realm of Natural Language Processing (NLP), which is the field concerned with enabling machines to understand human language, words are transformed into numeric vectors via word embeddings. The idea is that words with similar meanings will have similar vectors, and these vectors can capture complex relationships like synonyms, antonyms, and more.

The Role of a Vector Database

A vector database provides a way to store these embeddings. It can index large amounts of high-dimensional vectors and allows for efficient similarity searches. But why might someone need to use a vector database with OpenAI?

The first reason is to supplement the data. OpenAI's models, like GPT-4, have a "knowledge cutoff" – a date beyond which the model is unaware of world events or any information. For instance, GPT-4's knowledge cutoff is September 2021. By using a vector database, one could integrate more recent data to keep the model up to date.

The second reason is to incorporate proprietary or private data. OpenAI's models are trained on large, public datasets, and they don't have access to specific, confidential, or proprietary information. By leveraging a vector database, one can supplement the AI's training data with their own unique datasets.

Here are some examples of vector databases:

Pinecone: Pinecone is a vector database that is optimized for similarity search. It is a good choice for applications that need to find related data quickly and accurately. This is the vector database we use here at eBookFairs for embeddings.
Milvus: Milvus is a vector database that is designed for large-scale machine learning applications. It is a good choice for applications that need to store and search large datasets of vectors.
Weaviate: Weaviate is a vector database that is open source and easy to use but it also offers a managed cloud version which we investigated before settling on Pinecone. It is a good choice for applications that need a simple and flexible vector database.

Vectors and Cosine Similarity: The Basics

Now, let's delve into vectors and cosine similarity. Imagine a vector as an arrow pointing in a specific direction in a multi-dimensional space. In the context of embeddings, the direction and length of the vector capture certain aspects of the word or object's meaning.

Cosine similarity, on the other hand, is a measure of how similar two vectors are. It's called 'cosine' similarity because it calculates the cosine of the angle between the two vectors. If the vectors are identical, the angle is 0 degrees, and the cosine is 1, indicating maximum similarity. Conversely, if the vectors are completely dissimilar, the angle is 90 degrees, and the cosine is 0.

C# Pseudocode Examples

Let's look at some simple C# code examples on how to work with embeddings.

Here's a simplified example of how you might create a word embedding using a hypothetical library:

// Assume WordEmbedding is a class that creates a 300-dimensional embedding
WordEmbedding embedding = new WordEmbedding(300);
// Generate embedding for a word
Vector<float> wordVector = embedding.Generate("apple");
Next, let us see how you might use cosine similarity to find the similarity between two word embeddings:
Vector<float> wordVector1 = embedding.Generate("apple");
Vector<float> wordVector2 = embedding.Generate("orange");
// Calculate cosine similarity
float similarity = CosineSimilarity(wordVector1, wordVector2);

In this example, 'CosineSimilarity' is a hypothetical function that calculates the cosine similarity between two vectors.

Remember, these examples are simplified, and creating word embeddings or calculating cosine similarity in practice would require more complex algorithms or leveraging existing libraries like those that OpenAI provides.

How OpenAI Uses Embeddings in Simple Terms

In order to be as helpful as possible, we thought we would break this down a little bit further and be specific with how you could use embeddings to do something like provide sports facts that happened after the September 2021 data training cut-off for ChatGPT.

Populate a vector database like Pinecone with information like “The Astros won the World Series in 2022 defeating the Philadelphia Phillies.” You would use something like OpenAI’s text-embedding-ada-002 to turn that sentence into a vector of floats (numbers) that represented the fact that the Astros won the World Series in 2022. Once that is inserted into the database you are ready for the user to ask you a question.
Suppose the user asks “which team won the World Series in 2022?” Instead of just passing that directly to ChatGPT you first would use OpenAI’s text-embedding-ada-002 to turn that question into a vector. Once you have that vector you are ready to check Pinecone to see if you have any extra context to help ChatGPT.
Now you can pass that vector from the user that represents “which team won the World Series in 2022?” into Pinecone. Pinecone would use cosine similarity to calculate if any of the data you uploaded is similar to this question and it would indeed find the vector that represents “The Astros won the World Series in 2022 defeating the Philadelphia Phillies.” It would return that vector in the results. Once you have that you can modify your prompt.
Next you add something like the following to the original question asked: Using the following information as context “The Astros won the World Series in 2022 defeating the Philadelphia Phillies” answer the question “which team won the World Series in 2022?” Once you send that to OpenAI/ChatGPT you will get the correct answer even though the 2022 World Series occurred after ChatGPT’s data was cut off (aka September 2021).

That is a super simple example but you can see where you can use information that you have stored in a vector database to modify the question you are asking ChatGPT to provide context that it is going to need to get the question correct. Whether this is facts that happened after September 2021 or product specs for your company's products, this is how you use embeddings to get correct answers from ChatGPT.

Final Thoughts

In summary, embeddings are a powerful tool in machine learning and AI, transforming complex, high-dimensional data into a format that machines can understand and process. Vector databases allow us to supplement the AI's knowledge with more recent, unique, or proprietary data, thus broadening its understanding and potential applications. Understanding vectors and cosine similarity is key to leveraging these tools effectively. With the right grasp of these concepts, we're all better equipped to harness the power of AI. We are also sure that there are other interesting uses of vector databases for things like website searches/elastic search. In any event it’s good to know about “embeddings” and how they are used.

You must sign in to be able to post a comment.

Leveraging Embeddings and Vector Databases to Improve ChatGPT Responses

Why not keep reading?