Published on August 18th, 2016 | by Martin Guerrero
Machine Learning Simplified: Classification Using the k-Nearest Neighbor Technique
Don’t let the words Machine Learning intimidate you. It can be learned, even if you aren’t a machine. Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. These algorithms operate by building a model from example inputs to make data-driven decisions, rather than following static program instructions.
Machine Learning (ML) is particularly effective when it comes to solving breach detection problems, because it allows for smarter classification and enrichment of HDRs, which gives you a lot more visibility into the goings on of your network traffic.
The problem I’ll focus on in this article is an arbitrary Classification Problem, which I’ll solve below using an ML technique called k-Nearest Neighbor (kNN).
The Classification Problem:
Given current weather data and conditions, evaluate whether or not it would be a good idea to go hiking.
A) The Data
Before it can evaluate the problem, the Nearest Neighbor algorithm requires a good set of data that is properly labelled with the true Answer. Either it is good to go hiking or it is not good to go hiking now. This is a ML Category called Supervised Learning because it requires a lot of previous data which is properly labelled with the True State of Nature (the real Answer). Unsupervised Learning in Machine Learning does not require previously labelled data.
B) Example Labelled Data:
In this case, the Good for Hiking? column is the True State of Nature, whose answers are well known based on previous data.
C) What we are evaluating:
Let’s say it is Saturday July 30 at 14:58. The user is Joe and it is not raining outside.
The Classification problem that needs to be solved is to determine, based on that information, whether or not it is a good time to go hiking.
D) How do we evaluate using k-Nearest Neighbor?
The basic idea of 1-Nearest Neighbor is to take the current data vector we are evaluating (July 30, Saturday, 14:58, Joe, Not Raining) and see which of the data in our database is most similar to the current data vector. If the nearest and most similar neighbor is “Yes,”, then we evaluate that vector as “Yes, it is a good time to go hiking.” If the nearest and most similar neighbor is “No,”, then we evaluate the current data vector as “No, it is not a good time to go hiking.”
3-Nearest Neighbor is similar: the algorithm finds the three nearest and most similar neighbors to the current data vector. If two or more of the three closest neighbors are “Yes,” then we evaluate the vector as “Yes, go hiking.” Otherwise, we evaluate the vector as “No, don’t go hiking.”
E) How do we calculate similarity?
We can calculate how similar one vector is from another by calculating the distance between those vectors. The lower the distance value, the more similar those vectors are.
If we consider a two dimensional graph, and if we have a vector (x1,y1) and another vector (x2,y2), the distance between them is based on the Pythagorean Theorem: C^2 = A^2 + B^2. In other words:
If you want to extend this to N-dimensions, you can:
F) How do we calculate this distance if the values in the vector are not pure numbers?
You can still calculate the distance of one vector (such as the vector we are trying to evaluate) with a vector in the test set even if the values in a particular column are not pure numbers. We just have to create special rules for them.
Some of the rules we can create arbitrarily are:
- If ((Day Of Week ) == (Day of Week )) then use value of 0 else, use value of 1.
- If ((Month:) == (Month:)) then use value of 0, else, use value of 1.
- If (Day of Month == Day of Month) then use value of 0, else, use value of 1.
- If (User == User) then use value of 0, else use value of 1.
- If time, first convert time to minutes and find total distance in minutes between the two times, then divide the difference by 60*24, or number of minutes in a day. We do this so that the value of any attribute does not overpower any other attribute.
- If (Rain == Rain) then use value of 0, else use value of 1.
G) Sample distance calculation:
Our evaluation vector is:
(July 30, Saturday, 14:58, “Joe”, “Not Raining”)
So let’s compare this with one of our known test cases:
(December 21, Wednesday, 22:50, “Bob”, “Not Raining”)
The distance between the two will be:
distance = sqrt ( (Saturday – Wednesday)^2 + (July – December)^2 + (30-21)^2 + ((14:58 – 22:50) / 60*24)^2 + (Joe – Bob)^2 + (Not Raining – Not Raining)^2)
distance = sqrt ( (1^2 + 1^2 + 1^2 + (472/(60*24))^2 + 1^2 + 0^2 )
distance = sqrt ( 1 + 1 + 1 + (.328)^2 + 1 + 0)
distance = sqrt ( 1 + 1 + 1 + 0.107584 + 1 + 0)
distance = 2.027
Repeat, calculating the distance from the vector we are evaluating:
(July 30, Saturday, 14:58, “Joe”, “Not Raining”)
And compare it with all elements in the database.
If we choose 1-Nearest Neighbor, then we can find the data vector in the database which is the most similar, or nearest to the evaluation vector. Then the true “State of Nature” will be the State of Nature of the most similar data vector in the database.
I) The Results:
Let us evaluate: (July 30, Saturday, 14:58, “Joe”, “Not Raining”)
The Time 898 is the equivalent of 14:58 converted into minutes: (14*60)+58 = 898.
- If we are to use 1-Nearest Neighbor:
The nearest neighbor is Vector #1 with a distance of 1.414.
And since Vector #1 says “Yes, go hiking”, then our evaluation would be: Yes, go hiking.
- If we are to use 3-Nearest Neighbor:
The three nearest neighbors are:
- Vector #1: 1.414 => Yes
- Vector #6: 1.733 => Yes
- Vector #7: 1.734 => No
So the majority say: Yes, it is a good time to go hiking.
When evaluating classification problems, there are many other things to consider including making sure one attribute does not overpower another, or choosing the proper features and attributes (such as Day of Week or Day of Month), or choosing how to interpret those features. For instance, do we just do binary 0 and 1? Or do we use the Pythagorean theorem to calculate distance between two data values in a particular attribute?
But all that can be answered another time. If you’d like to learn more about how you can use machine learning can make your ongoing breach detection efforts easier, click here.
Martin Guerrero is a senior software engineer at SS8. His background includes intrusion detection, communications analytics, databases, and machine learning.