Working with Vector Databases: Weaviate

Jun 28, 2024

—

Background Information

Vector databases are specialized systems designed to store and query high-dimensional vectors, typically representing complex data like text, images, or audio. They excel at similarity searches, making them useful for applications in natural language processing, recommendation systems, and computer vision.

Weaviate is an open-source vector database that combines vector search capabilities with GraphQL integration. It allows for:

Storage of vectors and associated metadata
Similarity searches
Complex filtering and aggregations
Integration with machine learning frameworks

In this example, we explore managing a large collection of NASA documents by leveraging Weaviate’s vector database capabilities. We’ve previously divided these documents into manageable chunks and created embeddings for each chunk. While this discussion won’t cover the various methods for chunking documents or the numerous models available for creating embeddings, it provides a practical guide for inserting and querying data objects using Weaviate.

Weaviate Account Setup

To get started with Weaviate, our first step is to create an account and a “sandbox cluster” to use for our vector store. Weaviate allows you to have 2 clusters for free to use for prototyping, with the constraint that the clusters are automatically removed after 14 days. The only requirement is to sign up with an email address – no credit card information is needed.

After creating your account and your first cluster, look at the information for the cluster and copy the values for the “REST Endpoint” and the API key. Add them to your environment using the keys “WCD_URL” and “WCD_API_KEY” to be able to work with the code examples below.

Install the python client provided by Weaviate:

pip install -U weaviate-client  # For beta versions: `pip install --pre -U "weaviate-client==4.*"`

Inserting Data Objects

After installing the weaviate client, we can import the necessary classes for the client connection, configuration, and collections.

import weaviate
import weaviate.classes as wvc

# Then we use the cluster URL and Weaviate API key to connect the client.
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WCD_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCD_API_KEY")),
    additional_config=wvc.init.AdditionalConfig(timeout=(180, 180))
)

# Check to see if the connection is ready
client.is_ready()

While most of the information for this example follows closely to the Quickstart guide provided by Weaviate, we ran into issues with timeout errors when inserting objects. It took a few minutes to resolve, and we were able to keep moving forward. The documentation provided by Weaviate is top-notch, and we found what we needed for timeout values HERE

The Weaviate API doesn’t have a way to get_or_create a collection. Instead, we have to check for an existing collection and then create if we don’t find one.

collection = None
if client.collections.exists("Chunks"):
    collection = client.collections.get("Chunks")
else:
    collection = client.collections.create(
        name="Chunks",
        vectorizer_config=wvc.config.Configure.Vectorizer.none()
    )

We also are going to use our own vectors, instead of having Weaviate compute these for us. That’s why we’re using Vectorizer.none() above. Weaviate can be configured to use OpenAI or other APIs (along with passing your key) if you want to use that approach.

Now we can insert objects as a list, or individually. In our case, we are looping over a list of chunks from the NASA documentation. The properties of the data object can be whatever you want, except for a few restricted keys such as ‘id’.

chunk_list = list()

# For each item in our list of chunks:
chunk_list.append(wvc.data.DataObject(
    properties={
        "key": doc_id,
        "page": page,
        "chunk": chunk,
    },
    vector=embedding.tolist()
))

After we have a list of DataObjects, we can insert with a single line:

collection.data.insert_many(chunk_list)    # This uses batching under the hood

If you are using the hosting solution for Weaviate, you can now view your cluster, collection, and see how many objects have been uploaded.

Querying Data Objects

Now for the query side of this walkthrough, we can create the client and connect in the same way as we did before.

client = weaviate.connect_to_weaviate_cloud(

    cluster_url=os.getenv("WCD_URL"),

    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCD_API_KEY")),

    additional_config=wvc.init.AdditionalConfig(timeout=(180, 180))

)

# Check connection

client.is_ready()

We are going to use our own encoding as well for the query, remembering that the encoding approach of the query has to match the encoding approach of the data.

We’re using the SentenceTransformer library with the bge-small model for these encodings.

model = SentenceTransformer('BAAI/bge-small-en-v1.5', device='cuda')

text = "What is the ration of fuel to weight necessary for lunar liftoff?"

text_embed = model.encode(text)

Now we have an embedding of the text we are going to use for a semantic search in our query.

collection = client.collections.get("Chunks")

response = collection.query.near_vector(

    near_vector=text_embed.tolist(), # your query vector goes here

    limit=5,

    return_metadata=MetadataQuery(distance=True)

)

Weaviate provides a set of values in the metadata return for each query. There are a few parameters that are available by default and others that are available based on the type of query and modules that you have connected with your Weaviate instance. In our case, we are going to use the distance metric to see how well our results match our query.

Now we can print the results and see what type of matches we had:

for o in response.objects:

    print(o.properties)

    print(o.metadata.distance)

In this case, we are querying a lot of NASA documents that we have previously chunked, embedded, and inserted to our vector store. Here are the top three results from the query, along with the distance parameter of each. We’re using cosine similarity as our distance metric, so a lower value is better.

{'key': '20130014470', 'page': 253.0, 'chunk': 'Mass: The total mass at launch was 1,964.6 pounds (891 kg), consisting of 1,290 pounds (585 kg) for the spacecraft and 674.6 pounds (306 kg) of hydrazine fuel.'}

0.21913838386535645

{'key': '20220006404', 'page': 16.0, 'chunk': '• Remaining fuel used to lower perigee prior to spacecraft passivation'}

0.237518310546875

{'key': '19650025648', 'page': 3.0, 'chunk': 'The required input power t\nh\na t corresponds t o these specif icaSilver-zinc b\na\nt t e\nr\ni e s have a capacity of approximately On t\nh\ni s bas i s, the b\na\nt t e\nr\ny weight f\no\nr the 1-hour mission i\ns\n7 3 pounds.\nThis b\na\nt t e\nr\ny weight i s compatible with a\nt\no t a l spacecraft weight of 350 pounds, which the assigned launch vehicle - the'}

We can also filter the objects before the query is applied. This is useful if you have different categories of items in your vector store that you only want to search. If we want to restrict the results to only come from the first document with a key of “20130014470”, we can do that.


response = collection.query.near_vector(

    near_vector=text_embed.tolist(), # your query vector goes here

    limit=5,

    filters=wvc.query.Filter.by_property("key").equal("20130014470"),

    return_metadata=MetadataQuery(distance=True)

)

And get the following return, all from the same document:

{'key': '20130014470', 'page': 253.0, 'chunk': 'Mass: The total mass at launch was 1,964.6 pounds (891 kg), consisting of 1,290 pounds (585 kg) for the spacecraft and 674.6 pounds (306 kg) of hydrazine fuel.'}

0.21913838386535645

{'key': '20130014470', 'page': 258.0, 'chunk': 'low cost (< $80M), less than 1000 kg, kinetic impactor that would excavate water ice from the Moon, if it were present.\nThe program office noted that if the project were to breach $79M, weigh more than 1000 kg, or fail to meet schedule with the launch of LRO, it would be subject to a termination review.'}

0.26032769680023193

{'key': '20130014470', 'page': 18.0, 'chunk': 'If the Moon has water ice in sufficient quantities, it would represent a very compelling rationale for future exploration and lunar outposts could be located in the vicinity of this invaluable resource.\nWater ice, after all, could be converted to consumable water, breathable oxygen, and rocket fuel, and potentially even serve as a means for construction when combined with regolith or as shielding from solar radiation.\nDelivering mass to the moon is incredibly expensive and water is very heavy for a small volume.\nDelivering a ½ liter bottle of water to the Moon is projected to cost a minimum of $15,000 by weight.\nTherefore, having this resource available in situ to future explorers and inhabitants of the Moon is clearly worth investigating.'}

0.28107452392578125

Local Instance

Another good reason to use Weaviate is that they provide a docker container loaded with the same product that they host. You can run this locally for prototyping or host it privately if you have constraints that keep you from using their hosted version.

To run with docker:

docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.25.5

To connect to the local instance, you just change the connection call for the client:

client = weaviate.connect_to_local()

There are a lot of available configurations, modules that can be loaded, and options for securing the instance.

Key Takeaways

Throughout this guide, we’ve explored the practical aspects of using Weaviate, a vector database, for storing and querying document embeddings. Let’s recap the main points:

Connection and Setup: We learned how to connect to a Weaviate instance using the Python client, including setting appropriate timeout values.
Data Insertion: We covered the process of creating a collection and inserting data objects with custom properties and pre-computed embeddings.
Vector Querying: We demonstrated how to perform similarity searches using vector queries, which is at the core of Weaviate’s functionality.
Result Filtering: We explored how to apply filters to narrow down search results based on specific properties.
Metadata Usage: We showed how to leverage metadata, such as distance metrics, to evaluate the quality of search results.

Next Steps

To further your understanding and application of Weaviate, consider exploring these areas:

Advanced Querying: Experiment with more complex queries, including hybrid searches that combine vector similarity with traditional filters.
Performance Optimization: Investigate techniques for optimizing Weaviate’s performance, such as index tuning and query optimization.
Alternative Embedding Models: Experiment with different embedding models to see how they affect search quality for your specific use case. Learn more in our post How to Pick an Embedding Model – CFI Blog (cohesionforce.com)
Weaviate Modules: Explore Weaviate’s additional modules, such as reranking, question-answering or classification, to extend its capabilities. For more information on the power of reranking in RAG workflows explore our post on the Infer-Retrieve-Rank framework here : Understanding the Infer-Retrieve-Rank (IReRa) Framework – CFI Blog (cohesionforce.com)
Integration with Prompt Automation Framework DSPy : Investigate Weaviate’s deep integrations with frameworks such as DSPy (covered in our post here : DSPy: Revolutionizing Complex System Development with Language Models – CFI Blog (cohesionforce.com) )

References:

AI applications AI development AI Innovation Data integration Explainable AI Information retrieval Machine Learning

Comments

One response to “Working with Vector Databases: Weaviate”

Leveraging AI for CMMC Compliance: Lessons from the C3PAI Project – CFI Blog

July 17, 2024

[…] Learn more about implementing vector databases and enhancing your compliance management systems in our comprehensive post: “Working with Vector Databases: Weaviate” […]