Webinar: Join us, Tues 5/24. Nightfall & Hanzo experts will discuss how machine learning can enhance data governance, data security, and the efficiency of legal investigations. Register now ⟶

Person working on a page with data charts
Blog 5 min read

Understanding Sensitive Data Discovery: Classification and Tools

by Emily Heaslip Published Oct 16, 2022

In its 2022 Cost of a Data Breach report, IBM notes that for 83% of companies, it’s not if a data breach will happen — but when. The sheer volume of data, as well as the difficulty in monitoring shadow IT and the shift to remote work, means that IT security teams face a persistent and ever-changing risk landscape that makes it extremely difficult to keep information secure. 

Protecting sensitive data starts with data discovery. Sensitive data discovery helps IT teams gain greater insight into what they need to protect, where this valuable information lives, and the best next steps for addressing persistent insider and external threats. 

What is sensitive data?

The general definition of sensitive data is any information that should be kept secure and confidential from access by unauthorized users. In practical terms, there are a few more specific definitions of sensitive data, depending on which regulation or set of best practices you must follow. 

Generally speaking, there are two broad categories that businesses should know when implementing sensitive data discovery tools. 

  • Personal data: any information that can be used to identify, with some degree of accuracy, a living person.
  • Sensitive data: a subset of personal data that is subject to specific processing conditions under GDPR. 

Sensitive data is defined by the GDPR as personal data that includes someone’s race or ethnic origin, political opinions, religion or philosophical beliefs; trade-union membership; genetic and biometric data; health-related data; and information about someone’s sex life or sexual orientation. 

GDPR is just one regulation that governs the protection of sensitive data. Other compliance regimes, such as FERPA and HIPAA, further define the different types of sensitive data. 

[Read more: How To Protect Sensitive Data with Cloud DLP

Examples of sensitive data 

Sensitive data is further categorized into regulated versus unregulated sensitive data. Regulated data is specifically covered under laws such as GDPR; unregulated data may contain publicly available information that may still be highly sensitive. For instance, job applications, customer surveys, or contracts are examples of files that could contain unregulated sensitive data.

Here are some common types of regulated sensitive data that are subject to data protection laws: 

  1. Protected Health Information (PHI): regulated by the Health Insurance Portability and Accountability Act (HIPAA), PHI is defined by 18 identifiers that set the bar for “identifiable” medical information that can be traced back to a specific individual. 
  2. Education records: regulated by the Family Educational Rights and Privacy Act (FERPA), information like grades and transcripts, student schedules, exams and papers, student email, advising records, and any personally identifiable information (PII) is regulated sensitive data. 
  3. Customer financial data: regulated by the GLBA and the PCI-DSS, this information includes customer credit and debit card data, as well as customer PII. 

At a minimum, businesses that deal with these types of regulated sensitive data need to take concrete steps to make sure information is secure to avoid fines and penalties. Data discovery is a crucial step in ensuring that sensitive data is fully secured.

What is data discovery?

Data discovery is the process of discovering data across your organization’s systems. Within the context of information security, data discovery is a process generally carried out by auditing tools designed to scan applications, networks, or endpoints for specific types of data. These tools can be anything from data loss prevention (DLP) solutions, like Nightfall, to access brokers and other monitoring or data policy enforcement tools.

Depending on your organization’s determined use case, it could make sense to use a data discovery tool narrowly, such as within a single application or across all of your infrastructure. For example, a hypothetical outpatient clinic might wish to communicate over Slack with affiliated clinics about patient care notes to help triage incoming patients more quickly. In such a case, it might make sense to invest in a HIPAA-compliant data visibility solution specifically for Slack.

The data discovery process

The sensitive data discovery process is outlined in three phases: preparation, visualization, and analysis. 

During the preparation phase, data is cleaned and merged to meet a high standard of data quality within the networks, applications, and endpoints being examined. Data discovery tools can automatically erase outliers, unify data formats, detect null values, and standardize data quality during this step. Or, some data discovery tools like Nightfall can scan unstructured and structured data, reducing the time it takes to perform sensitive data discovery. 

Next, data discovery tools provide the IT team with visual maps of where sensitive data lives, travels, and is stored. Data mapping in the visualization step makes it possible to see which programs, applications, and devices need to be protected to safeguard sensitive information. 

Finally, analysis creates actionable steps for protecting sensitive data. This involves a range of practices and software, including IAM, endpoint and network security, and cloud DLP. 

Sensitive data discovery best practices

Ultimately, the approach you take is determined by the types of data discovery tools that make sense to use in your organization. However, you should consider the following features in your efforts to increase your organization’s data visibility.

Make classification part of the discovery process

Monitoring your data is central to the concept of data discovery. But, the ability to classify your data is arguably even more important. 

In the context of information security, data classification tools let you parse files and/or strings of data to properly categorize data found within structured or unstructured data sources. If this process is conducted with a high degree of accuracy (i.e. without false positives), this should let you determine the content and context of the data your organization uses and stores. 

There are many approaches to data classification. Some data discovery tools with classification features use regular expressions, or regexes, to determine the content of data. Other tools apply heuristics to assess the context of data. 

Nightfall is unique from these traditional approaches. Custom machine learning detectors are specifically trained to identify common types of PII across a variety of SaaS and IaaS environments. This allows the platform to both account for context and improve the accuracy of detection and classification capabilities compared to other solutions. With Nightfall DLP for GitHub, for instance, the API key detectors have significantly fewer false positives than the most popular tools.

Consider platforms that enable workflow implementation or remediation

The key benefit of data discovery and classification tools is that they typically provide teams with the insights needed to create thorough data use and storage policies. Effective programs go one step further by allowing administrators to implement workflows that enforce these policies across the applications or networks where their data lives. Nightfall’s Slack bot, for example, enables teams to automatically detect, quarantine, and delete offending PII across designated channels. 

[Read more: Guide to Data Loss Prevention (DLP) on Slack]

Automate as many processes as possible

The security landscape has grown in complexity with cloud, IoT, bring your own device, shadow IT, and other trends transforming where data lives and how much of it exists. Given the level of risk, automation is a genuine necessity for protecting sensitive data, especially in a remote work environment. Deploy security solutions that successfully leverage AI or are otherwise automated to provide a strong foundation for comprehensive data policies. 

Security teams using automated tools will prove to be more effective, as they won’t need to respond to every potential incident or security misconfiguration. Nightfall’s workflows are designed with this principle in mind, allowing security policies to be seamlessly enforced from the platform’s dashboard.

Want to learn more? You can find out more about data discovery and get started with Nightfall by scheduling a demo at the link below.

Subscribe to our newsletter

Receive our latest content and updates

Nightfall logo icon

About Nightfall

Nightfall is the industry’s first cloud-native DLP platform that discovers, classifies, and protects data via machine learning. Nightfall is designed to work with popular SaaS applications like Slack, Google Drive, GitHub, Confluence, Jira, and many more via our Developer Platform. You can schedule a demo with us below to see the Nightfall platform in action.

 

Schedule a Demo

Select a time that works for you below for 30 minutes. Once confirmed, you’ll receive a calendar invite with a Zoom link. If you don’t see a suitable time, please reach out to us via email at sales@nightfall.ai.

call to action

See Nightfall in action.

Schedule a demo