Leveraging BM25 and Vector Search in a Local RAG Application

Artifical Intelligence / April 1, 2025 • 9 min read
Tags: rag bm25 vector search sqlite3 projects

As a consultant in cybersecurity, I write numerous of reports documenting security vulnerabilities discovered in the customer’s application or network. Sometimes, there’s a considerable amount of findings, and this requires me to focus more of my hours on writing the report, instead of discovering vulnerabilities. For example, I have written a report that contained 70+ security vulnerabilities. This took time to write and to QA (quality assurance).

To reduce hours spent on writing the report, I turned to AI. Can I use AI to automatically expand my raw notes into well defined security findings? Turns out that worked very well. With the right prompting and relevant model, expanding my raw notes into a well structured finding worked great.

While the text expansion worked well, I had a second problem. My team has developed almost 300 vulnerability templates over the years. When we identify e.g. a Cross-Site Scripting (XSS) vulnerability, we copy our XSS template that contains a general description of the issue and relevant recommendations, then we include the specific vulnerability description. We have templates for web security, network security, active directory, AWS and more.

The problem that occurs is that once I have expanded my raw note, how do I automatically merge it with the correct template?

I experimented with a few solutions using Mistral Small 24B:

Providing templates to the AI

Provide the AI with your expanded note and the set of templates to choose from. The AI will pick the most relevant template based on the expanded note. While this worked well, it does not scale. It’s infeasible to send 300+ templates to the AI, even if they claim to have 100k token context.

Providing a summary of each template to the AI

My next experimentation was to provide the AI with my note and a JSON object containing the filename and keywords for each template. This reduced the amount of tokens sent to the AI and worked well too. However, there were several cases where the AI made an incorrect choice. I believe this was due to the keywords, perhaps they overlapped too much with a different template.

Final solution: RAG

Instead of sending templates or keywords to the AI, what if I gave the AI the tools to search a database for the relevant templates?

For my final solution, I implemented a Retrieval Augmented Generation (RAG) system - a framework that enhances AI outputs by first retrieving relevant knowledge from a database and then using that information to generate accurate, contextually appropriate responses. In my case, this meant creating a system that could search our extensive template library and find the most relevant security vulnerability templates to pair with my expanded notes.

I created a hybrid approach using BM25 and vector embeddings. I indexed all templates using BM25 and created embeddings for each template. This solution returns by default the five most relevant templates. To increase search accuracy, a reranking step was added to make sure that the most relevant templates were ordered first.

This solution worked very well, and does not require too much computation.

Best of all, it’s completely local.

A simple visualization of the system can be seen below:

flow.png Figure 1: The RAG pipeline

I chose the hybrid approach because neither embeddings nor BM25 provide 100% accuracy, but together it’s close enough.

For the remaining of the article, I will describe what BM25, embeddings and reranking are and how they complement each other. By the end of this article, you should have a few practical tips to start building your own solution.

BM25

Best Matching 25 (BM25) is a ranking function used by e.g. search engines to rank documents according to their relevance to a search query. BM25 is an improvement over TF-IDF (Term Frequency-Inverse Document Frequency) that calculates how often a search term appears in a document (frequency), and how unique or rare that term is across all documents (inverse document frequency).

TF-IDF can suffer from term frequency saturation, meaning the importance of a term diminishes after a certain frequency threshold. TD-IDF does not account for document length, which may result in documents getting a higher score because the search term appears often, even though the document is irrelevant. BM25 accounts for these shortcomings.

The following code snippets uses the rank-bm25 library to compute BM25 scores.

 1from rank_bm25 import BM25Okapi
 2
 3corpus = [
 4    "Hello there good man!",
 5    "It is quite windy in London",
 6    "How is the weather today?"
 7]
 8
 9tokenized_corpus = [doc.split(" ") for doc in corpus]
10
11bm25 = BM25Okapi(tokenized_corpus)
12# <rank_bm25.BM25Okapi at 0x1047881d0>
13
14query = "windy London"
15tokenized_query = query.split(" ")
16
17doc_scores = bm25.get_scores(tokenized_query)
18# array([0.        , 0.93729472, 0.        ])
19
20bm25.get_top_n(tokenized_query, corpus, n=1)
21# ['It is quite windy in London']

The output above shows that the query “windy London” matches the second document with a score of 0.93729472 (1.0 is maximum), which is printed on the last line.

Embeddings

Embeddings are numerical values stored in a multidimensional vector array, which can be thought of as point in 3D space. Each point carry semantic meaning and its similarity to other points in space. Points that are similar are clustered together. The similarity between two points can be calculated using cosine similarity which is a distance metric. It’s also possible to use Euclidian distance or Dot product to calculate the similarity. When searching using embeddings, it is usually called vector search.

Vector arrays represent objects such as words or images, you can use SentenceTransformers to load an embedding model to create a vector embedding. For example:

 1from sentence_transformers import SentenceTransformer
 2
 3model = SentenceTransformer("all-MiniLM-L6-v2")
 4
 5sentences = [
 6    "The weather is so nice!",
 7    "It's so sunny outside.",
 8    "He's driving to the movie theater.",
 9    "She's going to the cinema.",
10]
11
12embeddings = model.encode(sentences)
13
14print(embeddings.shape)
15# Output: [3, 384]
16
17similarities = model.similarity(embeddings, embeddings)
18print(similarities)
19
20# Output:
21# tensor([[1.0000, 0.7235, 0.0290, 0.1309],
22#        [0.7235, 1.0000, 0.0613, 0.1129],
23#        [0.0290, 0.0613, 1.0000, 0.5027],
24#        [0.1309, 0.1129, 0.5027, 1.0000]])

The example above returns a tensor which in this case represents a two dimensional matrix, where each array represents a sentence, and each element represents the similarity to the corresponding sentence.

Figure 2 visualizes the similarities between sentences. The diagonal shows that each sentence is 100% (1.0) similar to itself, while sentence one and two are similar and sentence three and four are similar. Meanwhile, sentences 3-4 are not similar to sentences 1-2.

tensor-similarity Figure 2: Table with similarities between sentences

I mentioned that vector embeddings can be visualized as points in 3D space. Below you’ll find a 3D representation of different points representing various items.

As can be seen, similar items are clustered together, and this is what makes embeddings so powerful. It allows computers to understand semantic similarity.

Vector Databases

Vector databases are used to store embeddings due to being optimized for storing and retrieving vector arrays. Vector databases are optimized for nearest-neighbor search algorithms like ANN (Approximate Nearest Neighbors), which allow you to quickly find similar vectors without exhaustively comparing against every vector in the database.

There are several options to chose from when selecting a vector database. If you are interested in running a local solution, you might consider:

  • Postgresql with pgvector extension
  • Qdrant
  • Milvus
  • ChromaDB

If you are considering a commercial approach, the following products might be of interest:

  • Zilliz
  • Pinecone
  • Weaviate

Chunking Strategies

But it is also important to aware of various chunking strategies when creating embeddings. I mentioned that embeddeings capture semantic meaning, but if you want to create embeddings for an entire book, you most likely want to chunk your embeddings into a number of sentences or paragraphs. However, if you cut your chunk too early in e.g. a paragraph, you might miss important semantic meaning.

Additionally, too small chunks might split related concepts and lose contextual relationships. Too large chunks can dilute specific topics and make retrieval less precise.

There are several chunking approaches one can chose from, including:

  • Sentence-based chunking: Clean grammatical boundaries but may separate related ideas
  • Paragraph-based chunking: Often preserves complete thoughts but varies greatly in length
  • Fixed-size chunking with overlap: Helps maintain context across chunk boundaries
  • Semantic chunking: Tries to divide by topic shifts rather than arbitrary boundaries

In my project, I went with the simple approach of creating a batch for every 32 characters. Since the results have been promising, I decided not to spend too much time optimizing chunking strategies.

Pitfalls to be aware of when working with embeddings

While embeddings can provide immense value, it’s important to know how to use it correctly. I previously described chunking strategies when creating embeddings. Some models have a large context window and can in theory handle larger chunks. Other models might have a lower context window and as such must use efficient chunking strategies.

Embedding models are trained on a specific corpora inherits biases present in that data, which can perpetuate stereotypes or outdated associations. This also highlights an important fact, specifically that if models are not continuously retrained, they will eventually experience temporal degradation. Models need to be updated as languages evolve and new concepts are introduced in society.

Furthermore, when training models it’s important to understand the underlying dataset. Models trained on one domain such as medical documents, will often perform poorly when applied in another domain, such as legal documents.

Additionally, if the model encounter rare words that appear infrequently in a dataset, it can lead to low quality embeddings.

Creating embeddings across multiple languages remains a challenge, especially for languages with different structures.

Reranking

Reranking is a step in e.g. a RAG pipeline where a reranking model receives a set of documents from a vector search, and reorders the documents by relevance to the search query. Some rerankers are also known as cross-encoders, meaning they process the search query and a document as a pair. This allows for complex interaction between them. It is also possible to use a large language model as a reranker.

Rerankers are important in a RAG flow because they can greatly enhance the final result. Vector search is great for finding documents that are similar to the search query, but might not capture nuance or context. This is what a reranker improves upon. Given a list of document, the reranker will make sure the most relevant documents are ordered first. However, this process is computational heavy and slows down the RAG pipeline.

A popular cross-encoder is ms-marco-MiniLM-L-6-v2, which can be used as follows:

 1from sentence_transformers import CrossEncoder
 2
 3# Load a pre-trained CrossEncoder model
 4model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 5
 6# Predict scores for a pair of sentences
 7scores = model.predict([
 8    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
 9    ("How many people live in Berlin?", "Berlin is well known for its museums."),
10])
11# => array([ 8.607138 , -4.3200774], dtype=float32)

The output shows that the first document as a score of 8.607138 while the second document as a score of -4.3200774, indicating its irrelevancy.

Conclusion

Building a local search solution using BM25 and embeddings very fun and highly rewarding. Now my colleagues and I can search for templates using natural language and be sure that we get the most relevant documents. Furthermore, now that a search foundation exists, next step could be to build agents that utilizes this search capability to find vulnerability templates.

Next Steps to Build Your Own RAG Solution:

  1. Start Small: Begin with a focused collection of documents that would benefit from semantic search. Security templates, documentation, or frequently referenced materials are ideal candidates.

  2. Choose Your Tools: Consider whether a simple solution with libraries like rank-bm25 and sentence-transformers meets your needs, or if you need a dedicated vector database like Qdrant or ChromaDB.

  3. Experiment with Models: Test different embedding models based on your domain. If you work in a specialized field, domain-specific models can significantly improve results.

If you’re implementing a similar solution, I’d love to hear about your experience or answer any questions.