02. 10. 2024 Davide Sbetti Log Management, Log-SIEM, Machine Learning, NetEye

Perform KNN Classification Using Elasticsearch

Hey everyone!

We played around a bit last time with our radar data to build a model that we could train outside Elasticsearch, loading it through Eland and then applying it using an ingest pipeline.

But since our data is in the form of vectors, could we actually exploit Elasticsearch vector database functionality and perform a sort of K-Nearest Neighbors classifier?

Of course we can! Let’s dive right into this (as usual, the code shown in the article can be found attached as a Jupyter Notebook).

Connecting to Elasticsearch

We start by creating the connection to Elasticsearch, using the Python Elasticsearch client.

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
import json

es = Elasticsearch(
    hosts=["https://localhost:9200"],
    basic_auth=(<username>, <password>)
)

In terms of authentication, use whatever’s most suitable for your case, keeping in mind that the client also supports certificate-based authentication.

Loading the Dataset

As we did last time, we can use pandas to load the dataset and inspect it a bit.

radar_data = pd.read_csv("ionosphere.csv")

radar_data.head()
attribute1attribute2attribute3…..attribute34class
0.99539-0.058890.85243-0.45300good
1-0.188290.93035-0.02447bad
1-0.033651-0.38238good
1-0.4516111bad
An example of how our data looks. Nice, right?

Okay, as we can see from the structure of the dataset, we have the numerical attributes spread among the different columns. But to exploit the vector functionalities of Elasticsearch we need them all to be in a single column containing the actual vector of features.

So let’s reshape the dataset a bit to bring it into this form, while also extracting features (X) and labels (y) into different variables:

X = radar_data.iloc[:,:34].values.tolist()

y = radar_data["class"].to_list()

Splitting the Dataset

To be able to classify some new samples based on their neighbors, we need to divide the dataset in two different subsets:

  • Training set: This is the set of data points that we will later use to classify new ones. Note that since in this case we aren’t really using a model but rather applying a distance metric, we don’t have an actual training step: indexing the documents will be enough.
  • Test set: These are the elements that we’ll classify based on the training set and on which we can then compute the accuracy of our classification, since we also know the real label assigned by the scientists.

In our case, we can for example decide to use 80% of the dataset for training purposes and the other 20% for testing:

X_train, X_test, y_train, y_test = train_test_split(
  X, 
  y, 
  test_size=0.2, 
  random_state=0, 
  shuffle=True
)

Indexing the Data in Elasticsearch

Okay, time to move to the Elasticsearch side of the game. We’d now like to index our documents to then be able to apply a KNN search approach to the documents.

As usual, in order to be able to index the data in the format we’d like, we need to define the correct mappings for our index. In our case, the most important part is the type of the column that we’ll use for our attributes, since it needs to be of type dense vector.

We will thus upload the following index template, using the Elasticsearch Python client:

{
    "mappings": {
        "properties": {
            "attributes": {
                "type": "dense_vector",
                "dims": 34,
                "index": true,
                "similarity": "l2_norm"
            },
            "class": {
                "type": "keyword"
            }
        }
    }
}
# We read the JSON template
template = None
with open("index_template.json", "r") as template_file:
    template = json.load(template_file)

# and then we load it using the associated function
if (template is not None and 
    not es.indices.exists_index_template(name="radar-data")):
        es.indices.put_index_template(
          name="radar-data", 
          index_patterns="radar-data", 
          priority=500, 
          template=template
)

And now after that we can finally index our data. Please note that, since we’d like to index more than one document, we can use the bulk API, which expects us to provide a list of actions, one per document.

We can assemble the actual document in the _source field, since it’s composed of just two columns, and then use the bulk helper to perform the bulk request.

# Create the list of actions to be sent to Elasticsearch
actions = [
    {
        "_index": "radar-data", 
        "_source": {
            "attributes": attributes,
            "class": c
        }
    } for attributes, c in zip(X_train, y_train)
]

# Perform the bulk operations
bulk(es, actions=actions)

The Actual KNN Classifier

Okay, now it’s time to move to the classifier itself. First of all, what is a K-Nearest Neighbors classifier?

Well, the idea is quite simple: given a certain new element that we would like to classify, we can base our classification on the known elements nearest to it, taking for example the class expressed by the majority of its neighbors, since we have two classes.

In our case, since we are talking about vectors, an appropriate vector distance measure will be used, such as their cosine similarity.

Okay, so why is it actually called K-Nearest Neighbors? The K plays a very important role, since it determines how many classified neighbors we should take into consideration when looking at a certain new data point.

So this is the core idea, and the nice part is that we can actually do the whole job using Elasticsearch.

How? Using a KNN query of course! This type of query allows us to obtain the K nearest neighbors of a data point.

Note: to be more precise, by using this type of query we are doing an approximated KNN search. Why approximated? Because the query accepts an additional parameter, in addition to k, which is the number of candidate neighbors to be returned. To improve the performance of such a search, Elasticsearch will first return a certain number of candidate neighbors from each shard (based on the HNSW algorithm), then compute the similarity w.r.t. the provided vector only with those elements and not the full set of documents present in the shard. This boosts performance but can affect accuracy, since the result may not always contain the exact k nearest neighbors.

Furthermore, we can also compute the most-seen class among the neighbors using the terms aggregation, something that allows us to avoid doing particular processing of the response, apart from extracting the result.

y_pred = []

for test_instance in X_test:

    search_results = es.search(
        index="radar-data",
        knn={
            "field": "attributes",
            "query_vector": test_instance,
            "k": 5,
            "num_candidates": 50
        },
        fields=["class"],
        source=False,
        aggregations={
            "top_class": {
                "terms": {
                    "field": "class",
                    "size": 1
                }
            }
        }
    )
    aggs_result = search_results["aggregations"]
    pred_class = aggs_result["top_class"]["buckets"][0]["key"]
    y_pred.append(pred_class)

And now that we’ve accumulated the predicted classes, we can calculate the accuracy using the accuracy_score function:

accuracy_score(y_test, y_pred)

which in our case returns about 83%, not too bad compared with the 89% we obtained from the decision tree during last experiment.

Conclusions

In this article we saw how it’s possible to use Elasticsearch as a vector database to perform KNN searches, and how this can already be used out-of-the-box as a KNN classifier.

And since we didn’t explore too many different values for the neighbor parameters, as our goal was not to obtain the best possible classifier yet, feel free to play with them and explore further optimizations 😀

Davide Sbetti

Davide Sbetti

Hi! I'm Davide and I'm a Software Developer with the R&D Team in the "IT System & Service Management Solutions" group here at Würth Phoenix. IT has been a passion for me ever since I was a child, and so the direction of my studies was...never in any doubt! Lately, my interests have focused in particular on data science techniques and the training of machine learning models.

Author

Davide Sbetti

Hi! I'm Davide and I'm a Software Developer with the R&D Team in the "IT System & Service Management Solutions" group here at Würth Phoenix. IT has been a passion for me ever since I was a child, and so the direction of my studies was...never in any doubt! Lately, my interests have focused in particular on data science techniques and the training of machine learning models.

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive