How AI & Machine Learning Powers Next-Gen Data Leak Prevention (DLP)

The current wave of digital transformation that has brought more and more businesses online has also introduced an unwelcome side effect: the surface area for attacks has ballooned. As individuals and businesses migrated their sensitive transactions into cloud applications, cloud service providers became responsible for providing high-fidelity data security. In a remote-first world, however, traditional data security solutions – including firewalls, network proxies, device-based agents, and CASBs – lack the precision and coverage to identify the risk vectors that grant unauthorized access to your business data. To better deter attackers, we need to upgrade our arsenal.

Data Loss Prevention (DLP) refers to the process of identifying data that could be considered sensitive, then providing options for remediation prior to a leak. Examples of sensitive data could include personally-identifying information (PII), passwords & secrets, protected health information (PHI), or commercially sensitive information.

Many companies choose to make security adjustments in response to incidents, but DLP offers a more proactive solution. By integrating DLP into your workflows, you can significantly reduce the likelihood of a data leak incident, while also mitigating the severity in the event a leak does occur. As a DLP vendor, Nightfall offers both cloud-native integrations with the most popular office productivity applications, as well as a developer API for those seeking solutions to more customized use cases.

The atomic unit that Nightfall uses to implement DLP is the detector, which refers to a model that is tuned to a particular data type – think "Credit Card" or "Driver’s License Number" – to which people ascribe particular importance. As of today, Nightfall offers pre-tuned detectors for numerous different data types ranging from bank routing numbers to passport numbers to encryption keys. These pre-built detectors can be combined and supplemented with your own customizations to fit several security and compliance-based use cases.

What is unique about the approach Nightfall is taking to solve this problem? And why might it be a better use of time to use our detectors "off the shelf" rather than implementing detectors yourself?

Let’s answer these questions by working through an example scenario.

Building a Credit Card Detector

Say your company has just realized that some internal correspondence between accountants has leaked and is now publicly accessible. You have been charged with scanning the chat data to determine whether any PII was compromised. After some initial manual inspection, you determine that there are a fair number of credit card numbers present in the data, so you decide that you will build a credit card detector to streamline the process and quarantine all leaked card numbers.

Because credit card numbers follow a well-defined format, one of the simplest ways we can start building a credit card detector is by using a regular expression. When you think about a credit card, the first example that you might jump to is something like this:

Based on this example, the simplest regular expression (regex) we could develop is:

\d{16}

Running the number from the screenshot above against this model succeeds! But of course, this solution is ignorant of potential formatting characters. We’ll assume that when typing out credit card numbers, people most commonly use spaces or hyphens to delineate groups of four numbers for readability.

The regex becomes:

\d{4}[- ]{0,1}\d{4}[- ]{0,1}\d{4}[- ]{0,1}\d{4}

This regex can now handle optional formatting characters, but it should be noted that in its current form, the regex would probably match unexpected formats. One such match could be:

12345678-98765432

To more properly address the optionality of formatting characters, we could use three regular expressions: one with no formatting characters, one that always checks for hyphens, and one that always checks for spaces.

Although these regular expressions work for many test cases, credit card numbers are not always guaranteed to be sixteen digits long. For example, American Express cards are fifteen digit numbers, and they are also formatted differently: 3759-876543-21001. Also, all the credit card numbers mentioned up to this point are for predominantly American credit cards; several international card providers support cards up to nineteen digits in length.

These additional conditions compound quickly, which makes it unwieldy to develop a single regular expression to capture them all. It is probably better for our implementation to separate the matching process into a list of several possible regexes, then run the input data through all of them. Here is an abbreviated list of some possible regexes:

\d{4}-\d{4}-\d{4}-\d{4}
\d{4} \d{4} \d{4} d{4}
\d{16}
\d{4}-\d{6}-\d{5}
\d{4} \d{6} \d{5}
\d{15}

Now that we’ve enumerated a set of regexes for all formats and lengths of all the major card providers that we want to cover, we should be done, right? Maybe not. Rigorous testing of this detector shows that the classification results are extremely noisy. For example, the candidate string 7777-6666-5555-4444 will match our detector, but it should be relatively obvious to a human that this is likely not a valid credit card number. But why is it "obvious", and how can we codify this case?

Take a look at some of the credit cards in your wallet and think about this for a minute. It is probably not a coincidence that the numbers are all of similar lengths across providers, or that so many of them seem to start with the digit 4. If we’re going to improve our basic algorithm, the next step is to do some research on how card numbers are generated.

Researching a New Solution

In 1989, the International Organization for Standardization (ISO) published the standard IEC 7812 to describe how cards must be numbered. The standard goes into more detail, but to give some highlights:

The first digit is the Major Industry Identifier (MII), which classifies the card as belonging to different industries, such as airlines, banking, and healthcare.
The first 6-8 digits (including the MII) are the Issuer Identification Number (IIN), which uniquely identifies a card provider
Most of the remaining digits are allocated by each card issuer (perhaps pseudorandomly, but these algorithms are likely not public for security reasons)
The final digit of the card (also called the check digit) is calculated by the Luhn algorithm. In other words, given the first n-1 digits of an n-digit card number, the value of the nth digit is deterministic.

We can use this information to make further advancements to our naive credit card detector that yield more improvements in accuracy. After our regular expression set identifies a candidate string of digits, we can apply the following checks:

Enumerate a list of IIN numbers that are supported for our use case. If the prefix of the candidate string does not match one of the entries in this list, eliminate it.
Compute the Luhn algorithm on the digit string to verify that the check digit matches. This optimization alone can invalidate the vast majority of the matches that our regex produces, since only 1 out of 10 card numbers with the given (n-1)-digit prefix is valid.

We can add both of these checks as part of a filtration pipeline that occurs after running the input data through our set of regexes. By implementing these checks, the false positives produced by our credit card detector should be reduced.

After incorporating these improvements, it is tempting to say that the detector is complete, but first let’s reflect on our goal state: what sort of false positive error tolerance can we endure in our application? If we are deploying the detector only for our script to scan the leaked accountants’ correspondence, perhaps the current iteration is sufficient. But after we have completed the one-off risk assessment, we might consider integrating the detector into a workflow that regularly scans the company’s internal messaging system. In this use case, not only do we want to scan messages regularly, but also a variety of files, such as CSV’s, PDF’s, and JPEG’s.

While a false positive rate of 1-2% may be tolerable during our manual inspection of the leaked accountant messages, it is too large to deploy in a production environment that might scan hundreds of thousands of messages and files a day. If we plow ahead with deploying our detector in production, we could inundate our security team with too many false positives. Leaving a shoddy model in production for too long can build a sense of alert fatigue among the security team, at which point the detector no longer produces any value.

If we are going to make our detector robust enough for high-scale, production traffic, we will need to refine our approach further still.

Introducing Machine Learning

A major limitation of using a regular expression to implement a detector is that it produces a binary classification result: either the regex matches your string, or it does not. From the product point of view, you may not be interested in seeing a list of matches, but perhaps a list of matches sorted by the probability that the match is legitimate. This problem introduces the need for a confidence metric that can be assigned to a candidate. Given that sensitive information frequently occurs within a passage of human language, we can augment a binary classification decision by considering the surrounding context. We can leverage this context to implement a machine learning model that upweights or downweights a classification decision, and finally produces a probability representing the likelihood of a match between the candidate and the detector.

We might start building our model by using keywords to upweight or downweight a regular expression match. For example, we could use the keywords "card", "expense", or "payment" to increase the probability that the digit string is a credit card number, and we could use the presence of a prefix like "my bank account number is" to disqualify or downweight the match probability. However, there are notable limitations that come with this initial implementation.

Simply checking for the presence of keywords or substrings could actually worsen the false positive rate, since this type of matching cannot account for contextual polarity. For example, the phrase "4242-4242-4242-4242 is not a valid credit card number" should probably be downweighted by a robust model, but our naive keyword-matching model would upweight it.
Curating a dictionary of keywords to use to train our model requires a significant amount of experimentation. Some keywords that might intuitively be low-hanging fruit, such as "card", could prove to be more harmful than beneficial when run against real world data; for example, the word “card” probably also frequently appears in the phrases "greeting card", "ID card", "business card", and "playing card".
Using leading and trailing keywords to build context assumes that we are trying to scan for content in a passage of human language, but this may not always be the case. When data leaks occur, the targeted data is often part of a structured format such as CSV or JSON, as these formats are well-suited for storing repetitive information. Ideally, our model could be smart enough to glean contextual information as it pertains to the text structure: for example, by reading CSV column headers or parsing the JSON keys.

A natural follow-up effort could be to use Natural Language Processing (NLP) to tokenize content and help build the model. This will certainly help us reduce the false positive rate on the detector, but the compute requirements and maintenance costs are worth examining. Implementing accurate NLP models is difficult enough on its own because document formatting varies greatly depending on the data source; significant effort must be devoted to tuning the model parameters over time. Without a team of data scientists that is committed to continually improving these models, it could be more trouble than it is worth to deploy them.

While a credit card detector can be implemented with some of the heuristics and machine learning models we have discussed, other sensitive data types may be impossible to build by codifying an exhaustive set of rules. For example, API keys come in a wide variety of formats, while others like human names feature many different character sets, spellings, and languages. For many detectors, this makes the selection of training data a much more deliberate, laborious process.

All of the work we have done so far is just for a single one of the detectors supported by Nightfall – one that is very well-codified at that. Other detectors do not always have an easily accessible single source-of-truth: for example, driver’s licenses in the United States are not issued by the federal government, but by state and territorial governments, each with their own unique processes for issuing numbers. This means that to build a detector for a US Driver’s License, we need to build separate models of the issuing formula for all 50 states, Washington DC, and territories with DMVs. Aside from being a large initial implementation effort, the maintenance burden for these detectors can be significant, especially as you consider the fact that these models evolve over time. For example, a driver’s license may have been issued in the 1990s with a different numbering scheme, but we should still support detection of this PII.

Onward

At the beginning of this post, we referred to detectors as our "atomic" unit for implementing DLP. We have continued enhancing the Nightfall product offering by building new abstractions on top of detectors to give users even more control over their detection quality. The engineering and data science challenges discussed in building this credit card detector only scratch the surface of what it takes to build a fully functional DLP system.

While we continue to maintain and expand our library of detectors, we have also:

Added support for logical operands to combine arbitrarily many detectors in a single detection rule
Implemented text extraction from the most common file types, including images and PDF’s, office file formats (documents, spreadsheets, presentations), compressed files like gzip, archives like zip and tar, markdown formats like html, and more.
Built and scaled a file scanning platform that leverages our machine learning-based models, which has processed terabytes of data and produced millions of classifications
Introduced automatic remediation actions to our scans, such as data masking and encryption

We’re excited to keep enhancing our offering and offer a first-class DLP platform, so that you can focus on business logic instead of having to maintain your own detector library. Or, if you’re still not convinced, you can get started with our detectors for free by signing up for a developer account here.

How AI & Machine Learning Powers Next-Gen Data Leak Prevention (DLP)

On this page

Building a Credit Card Detector

Researching a New Solution

Introducing Machine Learning

Onward

Schedule a live demo