Evaluating content inspection engines for data classification applications

Many organizations recognize the emergent need to discover, classify, and protect their sensitive information stored in cloud applications (SaaS) and infrastructure (IaaS) via a dedicated cloud content inspection process. However, cloud-native detection engines are a relatively new technology, and many corporate Information Security teams or Product Security developers are, understandably, not yet familiar with how to effectively evaluate cloud content detection.

Some cloud DLP vendors capitalize on this lack of standardized evaluation criteria, promoting attention-grabbing statistics on their websites such as “99% accuracy.” While these numbers may make you feel like you’re getting a great solution, they are actually somewhat meaningless without some additional context, and ultimately can distract you from the key questions you really should be asking.

Nightfall is here to help you navigate the process of evaluating cloud content detection engines, so we’ve put together some tips provided by our expert team of data scientists and machine learning engineers. Here’s what you should really be looking for when evaluating a detection engine for data loss prevention (DLP) or other content inspection, beyond the splashy numbers.

Detection Scope

The first thing you should understand is the scope of information types that can be detected. Any vendor can claim nearly 100% accuracy by simply restricting the types of information their solution can detect - perhaps they can detect highly structured information such as social security numbers ~100% of the time. But is that the only type of information you are interested in detecting?

Some vendors may also inflate their numbers by assessing how often they predict a negative outcome, or the lack of a sensitive finding. But sensitive information typically occurs at very low rates (e.g. 1 out of 100 tokens) to begin with - so by predicting a negative outcome, you have automatically achieved high accuracy (e.g. 99% accuracy) via pure statistical math. This type of claim cannot be used to differentiate between cloud DLP vendors.

What you should really be looking for is an adequate detection scope. The vendor should be able to provide you with a list of sensitive data types that they can detect (or that they used in determining their accuracy numbers). Make sure the types of data you care about are represented on the list.

Detector training and testing conditions

In assessing the quality of detection, it’s also important to know what dataset was used to evaluate the detection engine. There is no standard industry dataset to benchmark against, and vendors can use any data they want in order to get the desired results. For example, very controlled datasets can be used to favor high accuracy numbers. But the accuracy claims will only hold true for you if the training or testing dataset looks like the information the detectors will encounter in the wild. File format and data structure are significant in determining the quality of detection.

Ask what type of data the vendor used to assess their accuracy claims - or even more importantly, ask what data they use to train and maintain the detection engine. The datasets they use should approximate real world customer data.

Violation rate (recall versus precision)

A detection engine’s quality will also depend on its F-score, or how it performs in terms of recall (number of positive alerts, a.k.a. sensitivity) versus precision (number of correct positive alerts, a.k.a. specificity). It’s important to assess the recall and precision of a detection engine in the context of your organizational needs.

A bias toward recall means you’ll get more positive alerts - so violations are less likely to slip through the cracks, but at the cost of receiving more false positives. Organizations who have very low tolerance for the presence of certain information types may want to skew toward recall so that no potential violations go unnoticed. However, this approach will require greater internal resources to review alerts and weed out false positives.

On the other hand, a bias toward precision means fewer alerts, including fewer false positives. This approach would require fewer internal resources to triage alerts, but also introduces a greater risk that an atypical finding may not trigger a violation alert. Greater precision tends to suit organizations with more moderate risk tolerance, or those who are looking for a fast, low latency service to meet minimum DLP requirements.

Here, again, it’s important to understand how detectors are trained toward precision versus recall. If detectors are trained on datasets with a higher number of sensitive data tokens (as compared with data “in the wild”), as is often the case, then the result may be a great looking recall score - but way too many false positives for your day-to-day capacity.

When evaluating detection engines, understand how they perform in terms of recall and precision, and assess how that fits in with your organizational needs. Also, ask about the ability to configure or adjust detection limits. Detection engines that rely on rule-based detection or regular expressions (regexes) will have a fixed detection probability that cannot be tweaked to meet your needs - cloud access security brokers (CASBs) typically fall into this camp. On the other hand, Nightfall enables you to adjust the balance between detection recall and precision via configurable confidence and context thresholds - so you can strike the right balance of risk mitigation and internal resource capacity.

Look beyond the numbers

Don’t let misleading accuracy claims distract you while evaluating content detection engines for applications such as cloud DLP. Make sure you understand the factors that really matter for assessing a detection engine: scope, training datasets, and the balance between recall and precision. When you select a DLP solution based on these critical factors, you’re setting yourself up to achieve a quality of detection that meets your organization’s unique needs.

And no matter the quality of the detection engine at the outset, machine learning will always require some tuning and iterative improvements in the wild - so make sure you’ll be supported by a dedicated and experienced team.

Nightfall’s team of experts loves to educate pioneering InfoSec and Product Security teams about data science, machine learning, and cloud-native detection. If you’d like to talk to us some more, please schedule a demo below.

Evaluating content inspection engines for data classification applications

On this page

Detection Scope

Detector training and testing conditions

Violation rate (recall versus precision)

Look beyond the numbers

Schedule a live demo