Published on September 22nd, 2016 | by Martin Guerrero
HDRs Optimize Rule-Based and Machine Learning Classification of Network Breaches
SS8 has developed High-Definition Records (HDRs) as an answer to the lack of total network visibility that most enterprises face today. These HDRs provide more detailed information than NetFlow records by adding Application Level Metadata, such as the “To:” or “From:” field in an email to the record, and even include User Identity information.
High-Definition Records are extremely useful in optimizing the accuracy of classification between breach and non-breach events, whether we use a Rule-Based System or an artificially intelligent Machine Learning system for classification and detection.
How HDRs help breach classification by using a Rule Based System:
You can find a scenario using a Rule-Based System in the “NetFlow vs. HDR” white paper:
On the server is an email waiting to be downloaded. This particular email is a spear phishing attack containing a malicious attachment. Because your company is protected by SS8 BreachDetect, an alert has already been raised. The moment the malicious email entered the network and landed on the mail server – hours before the user checked his email – SS8 BreachDetect processed an HDR containing metadata identifying the malicious attachment. The attachment’s MD5 hash value matched an indicator-of-compromise reported by a threat feed, thereby triggering an alert and allowing remediation of the threat.
NetFlow, lacking rich application-level metadata, would not contain the information required to detect this type of attack.
This indicates a Rule-Based System of Classification because, given HDRs, a breach detection system can classify a record or event as a breach if the event’s MD5 Hash matches an entry in a special threat feed. In this case, some humans have created a list of compromises in a threat feed, and because HDRs are enriched with application metadata, a breach detection system such as that uses HDRs could easily classify the event as a breach.
Rule-Based System versus Machine Learning System to solve Breach Classification Problems
HDRs also improve the accuracy of classification using Machine Learning. The difference between a Rule-Based System and a Machine Learning System of Classification is that in a Rule-Based System, humans create rules (like if-then) to detect a breach based on their own wisdom and experience. In a Machine Learning System of Classification, humans present data to a computer, which then determines classification rules. As the computer receives more data, it can begin to adjust its own rules without human intervention, and so the machine learns.
Rule-Based Systems can become unwieldy to maintain over time, while Machine Learning Classification Systems can adjust more easily over time. Machine Learning methods could also classify extra events as breaches without having explicit matches such as a match from a threat feed. This would be beneficial because these methods could potentially discover new attacks without having to rely on known attack vectors found in threat feeds.
How HDRs help breach classification using a Machine Learning System
The Machine Learning Classification System relies on good data to help classify records. In NetFlow records, only network data such as IP addresses, ports, and flow information (such as flow start time and end, as well as the number of bytes transferred) are present. High-Definition Records also include application-level data such as the “To:” or “From:” fields in an email, or the URL in a Web session.
Here’s an example of an SS8 HDR record. The “User-Identity Enrichment” section and the section specifying Server Name and Serial Name and Common Name are examples of enriched application-level information that are present in an HDR but not in a NetFlow record:
Each field such as “Source IP:” or “Server Name” is called a “Feature”. Intuitively, you can see that having these extra “Features” such as the “To:” field in an email can help the Machine Learning Process properly classify.
In a previous post, I gave an overview of a particular classification problem using the Machine Learning algorithm called k-Nearest Neighbor. If we use the k-Nearest Neighbor algorithm, you might notice that if we have a previous history of many breaches that are using “dropbox” from user “jdoe”, then if we evaluate the particular record above, the distance between the particular record, and a historical HDR record (which was previously classified as a breach), is lower, increasing the chances that the given record we are evaluating is more likely to be a breach.
On the other hand, if we had only NetFlow records, Machine Learning Classification would be much less accurate because we would only be able to use fields or features such as “Source IP,” “Application” (which would be TCP), and the number of bytes transferred or packets transferred. It is possible that Machine Learning Classification could still work, but at a much more simplified level. To generate more accurate Breach Classification results and stop the more sophisticated attacks using Machine Learning methods, you would need to go deeper into communication flows, as with HDRs.
Martin Guerrero is a senior software engineer at SS8. His background includes intrusion detection, communications analytics, databases, and machine learning.