Webinar: Join us, Tues 5/24. Nightfall & Hanzo experts will discuss how machine learning can enhance data governance, data security, and the efficiency of legal investigations. Register now ⟶

Guides 5 min read

GitHub Secrets Detection & Data Loss Prevention Guide

by Michael Osakwe Published Nov 23, 2022

Environments like GitHub present data exposure risk in the form of secrets leakage and sensitive PII leaking from repositories. Read this online guide, for free, to learn about the problem of secrets exposure and leakage in GitHub, as well as how to easily implement secrets detection and scanning to prevent this risk.

What is secrets leakage, and why is it a problem?

As a collaborative distributed version-control platform, Git-based repositories can create environments where secrets & credentials are exposed without notice. Developers can push commits at any time of day, and if no review process is in place, they could push code that contains credentials and other sensitive token types. 

This increases the risk of sensitive information falling into the hands of a threat actor:

  • Through the hijacking of an authorized account
  • Through improper repository permissions management (i.e. Repo with sensitive data being set to public, then indexed).

Scanning for secrets (API keys, passwords, tokens, cryptographic keys) and business-critical PII within code repositories at scale is a challenge within the right tools in place.

Watch the following video for more context about this risk: 

Real world examples of secrets exposure in GitHub

Read our original reporting to learn about how data can be exposed within environments like GitHub:

Watch the following video for more examples, and a detailed illustration of how secrets leakage occurs in GitHub:

What types of credentials & secrets get exposed in GitHub repos?

According to studies, like North Carolina’s How Bad can it Git, almost 2,000 new unique secrets leak daily on GitHub. These include API keys, access tokens, and encryption keys from some of the most popular services.

What are the consequences & harms of secrets exposure?

  • Data exfiltration  – Secrets can be used to exfiltrate data from the systems they’re associated with.
  • Cryptojacking – If secrets allow access to compute resources, threat actors can hijack services, like AWS, in order to siphon compute resources for cryptomining or other CPU intensive tasks. While cryptocurrency industry developments (like the 2022 market crash and the move to Ethereum 2.0) have reduced the demand for crypto-related compute, crytopjacking campaigns remain somewhat common.
  • Ransomware Threat actors can clone and encrypt whatever data is contained in the environments associated with a secret.
  • Extortion/monetary loss, reputational harms, and compliance violations

How can you detect secrets in GitHub code repositories?

Some examples of tools you can use to detect secrets include:

  • GitHub provides an automatic token scanning service for limited number of token types for popular services (e.g. AWS, Azure, Alibaba).
  • TruffleHog – written in Python, uses regex and entropy-based flagging.
  • Gitrob – written in Go, uses keywords and tries for a broader detection range than API keys.
  • Git-secrets – limited to searching for AWS keys.

How do most secret detection tools work?

There are three major types of secrets scanning and detection methods that are in use today. These include:

  • Regular expressions – Regular expressions (regex) is used to search for expected characters that are anticipated to be part of a string, but regex is bad at capturing variation across different types of services, e.g. AWS or GCP.
  • Entropy – Entropy refers to the amount of complexity or variability in a string of characters (see Shannon Entropy). Setting thresholds for entropy can help build an informed determination about the likelihood that a string is a credential/secret, as opposed to any other piece of information.
  • Machine learning – In the context of secrets, machine learning refers to algorithms trained on features extracted from a broad set of API key patterns and their surrounding context in code. ML is capable of extracting whether a character string is a credential/secret or not based off the context of the finding, without relying on indicators like naming conventions, regexes, or entropy thresholds.  With techniques like natural language processing (NLP) and deep learning, naming conventions don’t matter—only meaning does.

Nightfall for GitHub uses the third method, machine learning, to detect secrets:

  • At-rest, this includes historical scans of full repositories
  • In real-time upon new code push event or in CI/CD process

What is Nightfall?

Nightfall is a platform to discover, classify and protect sensitive data across cloud SaaS & cloud infrastructure. 

  • Nightfall supports compliance efforts with a number of industry standards like PCI DSS, GDPR, HIPAA, CCPA, and much more. 
  • Nightfall works by continuously monitoring data flowing in and out of data silos and classifying that data with machine learning. Data marked as sensitive can be automatically quarantined, deleted, and redacted with workflows.
  • Nightfall integrates with GitHub via Oauth 2.0, meaning you can get started immediately. Integrate in seconds, then tell Nightfall which GitHub orgs and repos should be scanned in real-time for API keys, encryption keys, passwords, and more. 

Watch a demo video of Nightfall for GitHub.

How does Nightfall differ from existing tools?

Nightfall DLP is the industry’s first cloud-native data loss prevention solution that can discover, classify, and protect sensitive data in cloud environments. 

  • Designed to address the low accuracy of tools relying on traditional methods like regexes or entropy thresholds, Nightfall is trained on features extracted from a broad set of API key patterns and their surrounding context in code. 
  • Unlike other tools, Nightfall can be used to discover and protect against both PII and credential leakage across your code base.
  • Nightfall integrates with more than just GitHub. Apply the same detection rules across Jira, Confluence, Slack, and more!

Nightfall for GitHub key features & benefits

  • Scan entire GitHub organization on every push to detect credentials, PII, and other secrets in public or private repositories via high accuracy machine learning.
  • Choose which repos to scan as well as exclude specific tokens, files, and directories from scans via an allow list.
  • Leverage pre-tuned detectors to discover secrets from any service or build custom detectors.
  • View risk from an intuitive dashboard and inform developers of violations via Jira tickets. Once a secret is rotated, easily resolve all violations with that secret through a simple and intuitive dashboard.
  • Send violation alerts to Slack and export results into a SIEM or reporting tool with custom webhooks.
  • Drill into each violation to see details on the secret, the code snippet in GitHub, and any other violations with the same secret.

What can secrets detection & DLP detect in code repositories?

DLP solutions should be equipped to scan a broad set of data types, including personally identifiable information (PII), protected health information (PHI), Finance and payment card information (PCI), Health, Networking, Credentials & Secrets (API keys, cryptographic keys), and more.

Nightfall comes with pre-built detectors out of the box that cover a comprehensive set of data types, industries, and geographies.

Nightfall provides the ability to add in custom detectors, rules, keywords, and regexes as well.Review our list of Detectors and learn more about them in our Help Center.

Does secrets detection & DLP scan files too?

  • Nightfall supports a broad set of file types including but not limited to xls/xlsx, doc/docx, csv, plain text, ppt/pptx, PDF, HTML, and more.

How do I get started?

  • To get started with Nightfall, schedule a call with our sales team or contact our us directly at sales@nightfall.ai with any questions.

Subscribe to our newsletter

Receive our latest content and updates

Nightfall logo icon

About Nightfall

Nightfall is the industry’s first cloud-native DLP platform that discovers, classifies, and protects data via machine learning. Nightfall is designed to work with popular SaaS applications like Slack, Google Drive, GitHub, Confluence, Jira, and many more via our Developer Platform. You can schedule a demo with us below to see the Nightfall platform in action.

 

Schedule a Demo

Select a time that works for you below for 30 minutes. Once confirmed, you’ll receive a calendar invite with a Zoom link. If you don’t see a suitable time, please reach out to us via email at sales@nightfall.ai.

call to action

See Nightfall in action.

Schedule a demo