4 Components of Modern Data Classification & Protection Infrastructure

Why Content Inspection?

Data privacy is top of mind for every organization, with individuals wanting reassurance that their data is secure at all times. With the ever-increasing number of cloud applications on the market today, security teams are faced with the challenge of keeping track of compounding volumes of sensitive data that can flow internally, externally, and across systems.

From a security and compliance standpoint, data security and stewardship has increasingly become a headache. Not only can customers send any amount of sensitive information through a plethora of tools – support tickets, chat messengers, emails, proprietary software – but that information often gets transferred further into additional systems internally as teams try to provide support. The tools available to streamline data security have become scattered and disjointed, because each tool is limited in the number of applications it may cover, causing security teams to need to purchase and manage multiple security solutions to secure multiple applications. This leads to operational overhead, high costs, and, most importantly, inconsistencies in policy enforcement – ultimately, a dramatic deterioration of internal best practices.

We at Nightfall learned early – from our customers and our own experience in the data security space – that the cloud data security problem is really a data content inspection problem. Security teams need to identify and eliminate (or manage) sensitive content across all their applications, easily. If data is accurately detected and classified, security teams can be spared from having to engage in extensive data mapping exercises across thousands of tables, applications, and systems just to figure out where sensitive data lives.

If content inspection is uniformly applied to an organization's cloud environments, security teams can leverage just one tool to detect and remediate sensitive data from all their applications.

However, building high-quality content inspection systems is not just a fascinating engineering problem, it is also incredibly difficult.

Content inspection can be broken into two core problem statements:

The big data problem: how do we train and maintain high accuracy detection models at massive scale?
The distributed systems problem: how do we process data at the time of detection at scale?

In approaching these problems, there are four key areas that need to be considered:

Building high accuracy detectors
Deploying and scaling machine-learning based models
Parsing files and unstructured data
Handling immense data scale

1. Building High Accuracy Detectors

Accuracy in content inspection is key.

Security teams are flooded in a sea of alerts and alert fatigue leads to desensitization and inaction. A detector with low accuracy can introduce a significant amount of security & compliance risk for end customers. A high false-positive rate can lead to alert fatigue which would ultimately reduce the chance of remediating a true positive. In some cases, an elementary-level detector could be more dangerous than having no detector at all, since it may create a false sense of security for the security team.

But, on the other hand, if sensitive data is leaked, it can be an expensive security concern with a serious, negative impact on the company’s finances, resources, and reputation.

So security teams need alerts that are low-noise and highly reliable.

There is a substantial amount of research and NLP work required to bring a detector to market with high enough accuracy that would meet customer expectations.

Traditional methods for data detection include regular expressions (“regex”), high entropy, and fingerprinting. Some of these methods are still the subject of extensive research. For example, a language model is an incomplete statistical derivation of a vast amount of text based data, which makes it hard to apply entropy methods in every use case or scenario. Additionally, these solutions are brittle as they are not semantic-driven and can easily fail by design or be intentionally circumvented. Semantic-driven solutions are more accurate because they account for the meaning of phrases rather than being purely pattern-based. An example of semantic-driven matching is: `US DL` == `US Driver’s License` == `Driver’s License`. These phrases are equivalent in meaning but are written in different ways, which would cause regex matching to break.

Machine learning (“ML”) based methods are more modern and can deliver much higher accuracy results because they can better capture context and derive semantics from tokens or characters. Unfortunately, they also require the acquisition and annotation of massive datasets of sensitive data. According to the Synced Review, the cost of training the most advanced natural language processing (“NLP”) neural networks from scratch is estimated at tens to hundreds of thousands of dollars or more. Even with training, the most advanced NLP models could be limited by how well they can perform on certain text formats. Tabular data introduces additional requirements to come with good levels of performance. More often, the data under inspection might lack the context where the latest advanced neural networks usually excel.

In addition to needing massive datasets to train machine learning-based detector models, detector accuracy also requires operational support for ongoing maintenance and tuning – interpreting results, handling and debugging detection issues, training the model on false positives and negatives, and more. To use a canonical DLP example, a credit card number detector has multiple validation points, such as ensuring that their issuer identifier numbers conform to ISO/IEC 7812. Imagine managing geographic and industry-specific detectors globally – staying up to date on government websites and industry/trade memberships, and even reaching out to local agencies to collate format information.

2. Deploying and Scaling ML Models

Building ML-powered data loss prevention (“DLP”) systems that are low latency, fault-tolerant, and redundant is non-trivial. State-of-the-art models consume several parameters that increase the processing times, thus increasing detection request latencies. Therefore, it is imperative to find an optimal balance between the number of parameters in a model and resulting response time. While some of the latest models out there can be tuned to certain scenarios, they are still limited by the space of scenarios their training has been performed on (e.g. Wikipedia text corpus or other text corpuses that most researchers use to run their benchmarks). Most organizations have unique data structures, and any information produced would not be formatted to align with industry standards; this introduces the problem of data shift, rendering the models irrelevant to the current task. Given these complexities, in order to produce the most accurate results while keeping processing time and resources low, the models need to be tuned constantly.

Additionally, deploying CPU-intensive workloads (especially those that are ML-oriented) requires software tooling and hardware experimentation in order to scale, in addition to significant financial investment.

3. Parsing Files and Unstructured Data

Businesses deal in hundreds of file types - from emails to documents, spreadsheets to presentations, images to PDFs, or even zip files to proprietary file types.

It is common to extract text from files or other types of unstructured data to parse through content and search for sensitive information. However, even when text can be extracted, context can be lost. For example, cells in a CSV are organized into columns and rows; when extracted into raw text, this structural context is lost, and data corresponding to particular column or row headings is no longer organized in an easily classifiable way.

Some open source solutions like Apache Tika can help fill some of these gaps. However, our own experimentation has shown that these solutions tend to be insufficient because performance/accuracy is too low to support high detection accuracy rates, and can struggle with larger files. Nightfall has written custom file handlers to parse and understand file formats to improve the reliability, accuracy, and scalability of our platform. Many file formats even require a custom text extraction implementation in order for DLP to be successful.

4. Handling Immense Data Scale

Inspecting content across hundreds of types of messages and files, across all types of applications, systems, and databases, can mean processing hundreds of terabytes per day. It is critical to build an engine that can handle that amount of scale.

DLP is a core security practice – uptime and availability must be uninterrupted, and systems must be fault-tolerant; events cannot be dropped because of transient outages.

Reliability: The models performing content inspection need to be well trained and then tested to ensure they remain accurate and performant across different input data.
Availability: The system needs to scale with load and be able to maintain high uptime with automated testing, blue/green roll outs, and automated recovery for hardware failure.
Maintainability: Content inspection systems are living and breathing systems that need to be tuned and tweaked over time: new detectors need to be added, industry research needs to inform accuracy, new regulations need to be incorporated, etc.

At Nightfall, we have built our detection engine to process data in parallel and collate the results to adhere to low latency SLAs. We have focused on building a stable and scalable system that benefits from the shared load of multi-tenancy, rather than sitting idle outside of high usage times.

Our detectors have been optimized for efficiency, allowing Nightfall to process more bytes per second at a lower cost than would otherwise be possible without this tuning.

This is why we built our Developer Platform

Since Nightfall’s inception, we have learned what it means to build and maintain a high accuracy, scalable DLP product. We have developed expertise in the industry, trained and continue to train NLP-based models for our detectors, and operationalized DLP both internally and for our customers. In the process, we recognized that not everyone or every company has the time and resources to develop and maintain a highly accurate content inspection engine to integrate into their own products or tooling -- so we decided to verticalize DLP end-to-end with one API.

Enter the Nightfall Developer Platform. We handle DLP, so you can focus on building and scaling your product.

Building with Nightfall’s APIs allows you to leverage Nightfall’s proprietary detection engine to inspect content within your product or in any of your applications without needing to build from scratch and maintain complex models.

Common use cases:

Inspect content anywhere, in any data silo or data flow.
Add DLP and data classification capabilities to your applications.
Detect and de-identify PII, PCI, PHI, credentials & secrets, custom data types, and more.
Build compliance workflows for HIPAA, PCI, GDPR, CCPA, FedRAMP, and more.

Features:

Three tiers of service offerings, including a free tier to get started without any commitments.
Easy to use APIs and SDKs in popular programming languages such as Node.js, Python , and Java.
Endpoints for inspecting both text and files, and multiple methods of redaction.
Large library of examples, tutorials, documentation, and support.

Watch the video below to see how The Aaron's Company leverages The Nightfall Developer Platform to maintain data security within Aaron's own custom communication platform.

‍

[youtube:oQNAytmE64I]

If you're looking to get started with our platform, check out our API Docs & Quickstart. We also include guides and tutorials to exemplify how easy it is to hook the Developer Platform up to commonly requested applications such as Airtable, Amazon S3, and Zendesk.

If you have any questions or just want to talk shop, reach out to our Product team! These are interesting challenges that we love to discuss and solve; that’s why we’re here.