Webinar: Join us, Tues 5/24. Nightfall & Hanzo experts will discuss how machine learning can enhance data governance, data security, and the efficiency of legal investigations. Register now ⟶
What is Data Hygiene and Why Is It Important
What Is Data Hygiene And Why Is It Important?
Many organizations are already cashing in on the promise of big data, hailed as the world’s most valuable resource. However, this crude resource requires refining in the form of data hygiene.
Data errors and inconsistencies cost companies millions of dollars a year. Businesses that aren’t able to implement the tools, strategies, and training required often find big data to be more of an obstacle than an advantage. Until business leaders invest in strong data hygiene practices, big data’s promise will continue to remain elusive.
What is data hygiene?
As you design your approach, it’s helpful to start with the data hygiene definition. Data hygiene is the process of cleaning datasets or groups of data to ensure they’re accurate and organized. “Clean” data is that which is error-free, simple to understand, organized, and easy to duplicate.
Data hygiene is a little more complex than simply correcting spelling errors. Data can be outdated, incomplete, duplicated, or inaccurate; as a result, it takes more than using spellcheck to ensure clean data.
Why is clean data important?
Dirty data is an expensive problem. A survey of global businesses by Experian found that “Over three quarters (77%) say that inaccurate data hurt their ability to respond to market changes during the pandemic, while 39% say poor quality data has negative effects on customer experience.” Experian estimates that dirty data can cost the average business 15% – 25% of revenue: a $3 trillion loss to the US economy each year.
Bad data comes from a variety of sources. Human error and poor internal communication are the root causes of most dirty data, but these issues are compounded by the lack of a data strategy in many organizations.
“When different departments are entering related data into separate data silos, even good data strategy isn’t going to prevent fouling downstream data warehouses, marts, and lakes,” wrote one expert. “Records can be duplicated with non-canonical data such as different misspellings of names and addresses. Data silos with poor constraints can lead to dates, account numbers or personal information being shown in different formats, which makes them difficult or impossible to automatically reconcile.”
Improving data hygiene can also be a time-consuming task when there are no data strategies in place. One estimate found that knowledge workers are spending up to 50% of their time manually finding and correcting inaccurate data. Therefore, instituting data hygiene best practices can not only improve financial outcomes but also reduce the amount of time and resources dedicated to correcting dirty data.
Data hygiene best practices
Seeking to improve data hygiene at your organization? Here are some steps to follow to reduce the costs of dirty data and optimize data needed for key business decisions.
Perform a data audit
Before you invest in tools and processes to improve your data hygiene, it’s important to establish a baseline. According to Forbes, “About 27% of business leaders aren’t sure how much of their data is accurate.”
Determine the quality of your data to set achievable, quantifiable data hygiene KPIs. Your audit should examine all the systems that your company uses to collect, use and store data. Within each system, determine which data fields are necessary; for both compliance and efficiency, your business should only collect the data it needs. Note any naming conventions or formatting differences from one system to the next.
Practice data governance
Data governance is the principled approach to managing data during its life cycle — from the moment you generate or collect data to its disposal. By mapping out how data is used throughout your business processes, you can identify points where entry errors or communications mistakes may occur.
Assess how data moves through the organization: Where is it collected? Where is it being stored? Who is accessing it, and on what device? Not only can this show you where there is room for error, but it can also reveal where security vulnerabilities may exist.
[Read more: 4 Data Governance Best Practices]
Standardize data input
Create rules for users across the organization who work with datasets. Naming conventions, formatting, and other constraints should be enforced through training. Set rules for things such as:
- Abbreviations (Ave., St. vs avenue and street)
- Salutations (such as Ms. or Mr.)
- Numbers (1,000 or 1000)
- Home vs business address (which will you collect?)
- Phone numbers (123-1342 vs 1231342)
A good general rule of thumb is to keep data entry as simple as possible. Don’t use capitalizations or abbreviations since these can mess up a data set easily. Try to eliminate fussy formatting to reduce the potential for human error.
Use data cleansing tools
Data monitoring and cleansing tools can help root out instances of inaccurate or messy data. These tools use natural language searching, data modeling, and machine learning to identify patterns and anomalies.
Data cleansing tools come in a range of different prices and capabilities. Some tools, like DeDupley, specialize in one area of data cleansing, such as removing duplicates. Other options, such as Experian Data Quality, can help you check emails, addresses, and telephone numbers in bulk. As you explore different tools, look for software that can automate some of the time-intensive manual processes that often result in mistakes.
A data loss prevention tool like Nightfall adds an important layer to improve data security. Nightfall automatically scans both structured and unstructured data in cloud security programs for instances in which PII, PHI, PCI, credentials, or secrets have been shared insecurely. This can help improve data hygiene, as detectors can send an alert when a formatting error or dirty data has created a vulnerability in your system.
Reduce organizational siloes
Finally, a key aspect of data hygiene is sharing consistently among internal teams. For instance, reducing siloes within teams like sales and marketing can significantly improve data hygiene.
“Every year, sales departments lose approximately 550 hours in selling time (the equivalent to 27% of each rep’s total selling time) as a result of poor CRM prospect data,” wrote Forbes. “Marketing departments are similarly crippled by the very real pain associated with dirty data. 60% of marketers don’t trust the health of their data.”
Training, standardization, and the right tools are all key components of improving data hygiene. By implementing a more streamlined, accurate approach to collecting and using company data, organizations can immediately start saving time and money.
Increasingly, data security professionals are using the term data hygiene to refer to their security posture within cloud environments. For example, listen to a clip from our podcast episode with Bent Lassi, the CISO of Bluecore, where he discusses what the term means to him.
Within the context of data security, data hygiene specifically refers to the practice of ensuring that sensitive data is only stored within sanctioned environments and that any inappropriately disclosed sensitive data is removed from environments where it doesn’t belong.
The risk of poor data hygiene within the context of data security is that misplaced information can be discovered by unauthorized parties. In the case of PII or other customer data, such intrusions can constitute data breaches which can cost organizations upwards of tens to hundreds of thousands of dollars, if not much more.
However, even in instances where PII or other customer data isn’t exposed, sensitive data leakage can provide opportunities for lateral movement or privilege escalation into more sensitive areas of an organization’s tech stack. For example, when API keys are posted to GitHub, if they’re discovered by threat actors, they can be used by unauthorized parties to access third-party accounts and services.
This risk remains across all SaaS and cloud environments and requires that organizations adopt a zero trust posture towards data security, generally through tools like Nightfall which enable continuous data security and compliance. The rise of cloud misconfigurations and supply chain attacks are two trends that have only increased the urgency of this need.
Data security hygine is often wrapped together with the concept of cyber hygine, or more generally, the act of identifying vulnerabilities and lowering the risk of insider threat before they become costly liabilities.
Want to learn more? You can find out more about data security hygiene and get started with Nightfall by scheduling a demo at the link below.
Subscribe to our newsletter
Receive our latest content and updates
Nightfall is the industry’s first cloud-native DLP platform that discovers, classifies, and protects data via machine learning. Nightfall is designed to work with popular SaaS applications like Slack, Google Drive, GitHub, Confluence, Jira, and many more via our Developer Platform. You can schedule a demo with us below to see the Nightfall platform in action.
Schedule a Demo
Select a time that works for you below for 30 minutes. Once confirmed, you’ll receive a calendar invite with a Zoom link. If you don’t see a suitable time, please reach out to us via email at firstname.lastname@example.org.