GitHub DLP (Data Loss Prevention): The Essential Guide

Environments like GitHub present data exposure risk in the form of secrets leakage and sensitive PII leaking from repositories. Read this online guide, for free, to learn about the problem of secrets exposure and leakage in GitHub, as well as how to easily implement secrets detection and scanning to prevent this risk. You may also download this content here.

What is secrets leakage, and why is it a problem?

As a collaborative distributed version-control platform, Git-based repositories can create environments where secrets & credentials are exposed without notice. Developers can push commits at any time of day, and if no review process is in place, they could push code that contains credentials and other sensitive token types.

This increases the risk of sensitive information falling into the hands of a threat actor:

Through the hijacking of an authorized account
Through improper repository permissions management (i.e. Repo with sensitive data being set to public, then indexed).

Scanning for secrets (API keys, passwords, tokens, cryptographic keys) and business-critical PII within code repositories at scale is a challenge within the right tools in place.

Watch the following video for more context about this risk:

‍

[youtube:f_vZHhVzhOY]

Real world examples of secrets exposure in GitHub

Read our original reporting to learn about how data can be exposed within environments like GitHub:

2020 GitHub hack examples. Read about five stories from 2020 that illustrate just how common secrets leakage can be.
Supply-side GitHub attacks in 2022. This year we wrote two stories about how supply side attacks, leveraging OAuth tokens from GitHub connected apps could have exposed code-repositories to threat actors. Read our analysis of this occurrence and read how we helped a customer address risk before and during one of these attacks.

Watch the following video for more examples, and a detailed illustration of how secrets leakage occurs in GitHub:

‍

[youtube:dmSk8X4pJgE]

‍

What types of credentials & secrets get exposed in GitHub repos?

According to studies, like North Carolina’s How Bad can it Git, almost 2,000 new unique secrets leak daily on GitHub. These include API keys, access tokens, and encryption keys from some of the most popular services.

Features	# Total	# Unique
Google API Key	212,892	85,311
RSA Private Key	158,011	37,781
Google OAuth ID	106,909	47,814
General Private Key	30.286	12,576
Amazon AWS Access Key ID	26,395	4,648
Twitter Access Token	20,760	7,935
EC Private Key	7,838	1,584
Facebook Access Token	6,367	1,715
PGP Private Key	2,091	684
MailGun API Key	1,868	742
MailChimp API Key	871	484
Stripe Standard API Key	542	213
Twilio API Key	320	50
Square Access Token	121	61
Square OAuth Secret	28	19
Amazon MWS Auth Token	28	13
Braintree Access Token	24	8
Picatic API Key	5	4
Total	575,456	201,642

What are the consequences & harms of secrets exposure?

Data exfiltration – Secrets can be used to exfiltrate data from the systems they're associated with.
Cryptojacking – If secrets allow access to compute resources, threat actors can hijack services, like AWS, in order to siphon compute resources for cryptomining or other CPU intensive tasks. While cryptocurrency industry developments (like the 2022 market crash and the move to Ethereum 2.0) have reduced the demand for crypto-related compute, crytopjacking campaigns remain somewhat common.
Ransomware – Threat actors can clone and encrypt whatever data is contained in the environments associated with a secret.
Extortion/monetary loss, reputational harms, and compliance violations

How can you detect secrets in GitHub code repositories?

Some examples of tools you can use to detect secrets include:

GitHub provides an automatic token scanning service for limited number of token types for popular services (e.g. AWS, Azure, Alibaba).
TruffleHog – written in Python, uses regex and entropy-based flagging.
Gitrob – written in Go, uses keywords and tries for a broader detection range than API keys.
Git-secrets – limited to searching for AWS keys.

How do most secret detection tools work?

There are three major types of secrets scanning and detection methods that are in use today. These include:

Regular expressions – Regular expressions (regex) is used to search for expected characters that are anticipated to be part of a string, but regex is bad at capturing variation across different types of services, e.g. AWS or GCP.

Entropy – Entropy refers to the amount of complexity or variability in a string of characters (seeShannon Entropy). Setting thresholds for entropy can help build an informed determination about the likelihood that a string is a credential/secret, as opposed to any other piece of information.
Machine learning – In the context of secrets, machine learning refers to algorithms trained on features extracted from a broad set of API key patterns and their surrounding context in code. ML is capable of extracting whether a character string is a credential/secret or not based off the context of the finding, without relying on indicators like naming conventions, regexes, or entropy thresholds. With techniques like natural language processing (NLP) and deep learning, naming conventions don’t matter—only meaning does.

Nightfall for GitHub uses the third method, machine learning, to detect secrets:

At-rest, this includes historical scans of full repositories
In real-time upon new code push event or in CI/CD process

What is Nightfall?

Nightfall is a platform to discover, classify and protect sensitive data across cloud SaaS & cloud infrastructure.

Nightfall supports compliance efforts with a number of industry standards like PCI DSS, GDPR, HIPAA, CCPA, and much more.
Nightfall works by continuously monitoring data flowing in and out of data silos and classifying that data with machine learning. Data marked as sensitive can be automatically quarantined, deleted, and redacted with workflows.

Nightfall integrates with GitHub via Oauth 2.0, meaning you can get started immediately. Integrate in seconds, then tell Nightfall which GitHub orgs and repos should be scanned in real-time for API keys, encryption keys, passwords, and more.

Watch a demo video of Nightfall for GitHub.

How does Nightfall differ from existing tools?

Nightfall DLP is the industry’s first cloud-native data loss prevention solution that can discover, classify, and protect sensitive data in cloud environments.

Designed to address the low accuracy of tools relying on traditional methods like regexes or entropy thresholds, Nightfall is trained on features extracted from a broad set of API key patterns and their surrounding context in code.

Unlike other tools, Nightfall can be used to discover and protect against both PII and credential leakage across your code base.

Nightfall integrates with more than just GitHub. Apply the same detection rules across Jira, Confluence, Slack, and more!

Nightfall for GitHub key features & benefits

Scan entire GitHub organization on every push to detect credentials, PII, and other secrets in public or private repositories via high accuracy machine learning.

Choose which repos to scan as well as exclude specific tokens, files, and directories from scans via an allow list.

Leverage pre-tuned detectors to discover secrets from any service or build custom detectors.

View risk from an intuitive dashboard and inform developers of violations via Jira tickets. Once a secret is rotated, easily resolve all violations with that secret through a simple and intuitive dashboard.

Send violation alerts to Slack and export results into a SIEM or reporting tool with custom webhooks.

Drill into each violation to see details on the secret, the code snippet in GitHub, and any other violations with the same secret.

What can secrets detection & DLP detect in code repositories?

DLP solutions should be equipped to scan a broad set of data types, including personally identifiable information (PII), protected health information (PHI), Finance and payment card information (PCI), Health, Networking, Credentials & Secrets (API keys, cryptographic keys), and more.

Nightfall comes with pre-built detectors out of the box that cover a comprehensive set of data types, industries, and geographies.

Nightfall provides the ability to add in custom detectors, rules, keywords, and regexes as well.Review our list of Detectors and learn more about them in our Help Center.

Does secrets detection & DLP scan files too?

Nightfall supports a broad set of file types including but not limited to xls/xlsx, doc/docx, csv, plain text, ppt/pptx, PDF, HTML, and more.

How do I get started?

To get started with Nightfall, schedule a call with our sales team or contact us directly at sales@nightfall.ai with any questions.