Webinar: Join us, Tues 5/24. Nightfall & Hanzo experts will discuss how machine learning can enhance data governance, data security, and the efficiency of legal investigations. Register now ⟶
How to Scan GitHub Repositories for Committed Secrets and other Code Snippets
One of the core aspects of any information security program is maintaining the confidentiality and integrity of an organization’s data. Modern cloud environments can often make this difficult, with security teams having to maintain visibility and manage controls across a wide variety of SaaS and cloud infrastructure systems. Among these systems, code repositories like GitHub can be a lesser-known source of secrets leakage. In 2019, GitHub estimates that over 44 million repositories were created, and over 10 million new developers joined the platform. This comes as no surprise, as GitHub is the world’s largest host of source code. With that designation comes a substantial volume of committed code. While cloud-based version control platforms like GitHub are a boon for organizations seeking to productively manage large distributed teams, such environments can make it incredibly easy for mistakes, like hard-coded credentials or other types of exposed secrets, to proliferate. As such, many teams have begun seeking ways to quickly search their repositories for such content. In this post, we’ll go over the scope of the problem of secrets exposure as well as discuss the options you have for finding and removing secrets from GitHub.
What are credentials and secrets?
Credentials and secrets are sensitive pieces of data like passwords, API keys, encryption keys, tokens, certificates, and other data that should be encrypted or secured within a cloud environment and typically found in code. These credentials and secrets act as a key to unlock protected information or resources, or to identify a privileged end user or role. Thus, they should always be kept private and not shared openly within an organization. But the reality is that credentials and secrets are in danger of being exposed or shared on cloud systems daily. For example, credentials and secrets may be embedded directly in code repositories, or shared via email or chat among developers & end users.
The video below provides some more context on how and why secret leakage and exfiltration occurs.
Understanding the scope of credentials and secrets leakage
Code repositories or repos, like many other highly collaborative SaaS environments, create increased opportunities for sensitive data exposure to occur without warning or notice. Take for instance a code repo where both internal and external collaborators submit code. They could push commits at any time of the day, and if no review process is in place they could push code that contains credentials and other sensitive tokens within it. Even if the repo were private, you still may wish to strictly enforce what types of tokens are contained within your codebase to maintain your best practices. Research, like a North Carolina State University 2019 study titled “How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories,” have quantified just how common credentials and secrets exposure are within GitHub. The researchers who conducted this study found that thousands of keys leak from public repositories on a daily basis with hardcoded cryptographic keys and API keys being critical sources of leakage. To address this long and ongoing problem, GitHub has offered limited secret scanning for code pushes to public repositories containing popular token types like AWS, Azure, and Alibaba. GitHub will notify the service provider of any credentials leak and have them decide how they want to address the issue. Despite this, secrets leaks still occur on the platform. Earlier this year, a story broke about an AWS DevOps Cloud engineer who inadvertently made public nearly a gigabyte of sensitive data after making a commit to a personal repository. Another story from this year includes Canadian telecom company Rogers Communications having passwords and source code exposed on GitHub. Leakage isn’t limited to GitHub, though; for example, German automaker Daimler leaked Mercedes-Benz’s source code for smart car components through an unsecured GitLab server last month.
Watch the video below for a brief history of data breaches leveraging codebases like GitHub as an attack vector.
How can you prevent secrets leaks in GitHub?
Despite the scope of the problem, there are a variety of practices that organizations can take to begin reducing the risks of credentials and secrets being exposed within their codebase.
Standardize coding conventions and practices
In collaborative cloud environments with high volumes of activity, it’s very easy for organizations to fail to put rules regulating user behavior into place. This digital “housekeeping” is essential, as when users have different conceptions of what behaviors are allowed within an environment, things can turn into the wild west, where no one is responsible for cloud and data security. One of the rules that’s essential to put into place is to standardize coding conventions by eliminating practices like hard coding credentials within code and developing a consistent code review process that evaluates whether or not designated practices have been followed.
Make sure your production environment remains private
When it comes to secrets leakage, one piece of low hanging fruit to address is permission settings. Your organization needs to maintain visibility into its production environments at all times, ensuring that any associated code repos remain private. Within your GitHub org, make sure that org owners perform periodic reviews of repo privacy settings to ensure that repos which shouldn’t be public remain private.
Implement periodic reviews of your codebase
Reviewing code before commits is pretty standard practice, but you may also wish to standardize reviewing code after it’s been committed, as well. Periodic code reviews will give you an opportunity to ensure your codebase remains devoid of leakable secrets.
For a more in-depth discussion of best practices, listen to Nightfall CTO Rohan Sathe discuss some in the clip below, or read our GitHub remediation guide online for free.
Can I use GitHub search to find exposed credentials?
GitHub provides search functionality from its site, and if you know what you’re looking for, it could be an excellent start for conducting a quick on-the-spot manual review. GitHub search uses an ElasticSearch cluster to index projects every time a change is pushed to GitHub, meaning that through GitHub’s search, you can find publicly available code on all of GitHub or narrow results to a particular organization or repository. GitHub’s search functionality, at least when searching publicly, does have some limitations, especially around searching code in forks, repositories exceeding 500,000 files, or in branches other than the default. However, this option is not intended to be comprehensive but instead provide a team with a quick way to scan short code snippets in recent memory.
Additionally, many users have developed command line interfaces, like gh-search-cli that let you use Git commands to conduct searches of code within repositories on GitHub.GitHub itself is also developing a CLI, but it’s currently in beta with a limited number of features and does not yet have full search functionality but could in the future. However, for more robust and comprehensive secrets detection, you’ll likely want to use one of the more dedicated tools that have become popular in recent years.
What do most dedicated secrets detection tools have in common?
For more comprehensive searching functionality, there are a growing number of tools tuned to finding specific types of credentials and secrets within repositories. These include tools like truffleHog, Auth0’s Repo Supervisor, AWS’s Git Secrets, Yelp’s Detect Secrets, or the UK Home Office’s Repo Security Scanner. We’ve covered some of these tools and many others like them before. For a more in depth look, you can read our post detailing Radar, our own solution, or read our guide to secrets detection. We’ll briefly cover the scope and limitations of these tools broadly below.
1. The bulk of these solutions are tuned to detect a limited number of secrets
Given the wide variation of secrets and credentials that can be found within a codebase, developers have gone about addressing the problem of secrets detection from a wide range of angles. The most effective tools are designed to exclusively identify a specific type of secret. For example, Git Secrets, which was developed by AWS labs, is specifically designed to detect committed AWS credentials using prebuilt regexes designed to capture unique patterns typically associated with AWS keys.
Tools that can be used to detect keys across more domains and types, like certificates and passwords, tend to use Shannon entropy thresholds to indicate whether a particular string is sufficiently random enough to resemble a secret like a password. For example, Repo Supervisor provides an entropy meter and allows users to set a threshold to determine the sensitivity of the scanner. While this can allow for more flexible detection of secrets across different categories, it tends to be a noisy detection method resulting in higher false positives and an extensive number of results to review. This leads to the second problem that many of the tools widely available today experience.
2. High false positive rates lead to inconsistent detection of secrets
Another issue that’s relatively consistent among existing tools is that they tend to have higher than expected false positive rates, even when they’re tuned to detect specific types of secrets. For example, we found that only one of the tools we tested, Gitrob, had an F1 score above 50% out of the box. An F1 score below 50% would mean that a system is poor at determining relevant data by extensively excluding positive matches and overrepresenting false positives.
It’s worth noting, though, that once tuned, truffleHog’s entropy detection method received an even higher score of 77%. However, along with that score came a very high number of false positives as the algorithm became noisier.
3. Limited scanning methods may make operationalizing security harder
A final consideration is that many of these tools are rather limited in the manner you can scan repositories. Some tools, like Git Secrets, are intended to be used before commits, while others like Gitrob are intended to look at committed code. A handful of tools, like GitLeaks, might be more comprehensive, allowing for whole org scans pre and post-commit. While users will likely find some options that map to their intended use case, their choices ultimately will be constrained by this limitation. Not to mention, the open source tools mentioned require you to setup, manage, and run them within your infrastructure.
These three limitations of existing tools result in organizations adopting solutions that only marginally satisfy their security needs. While some security is better than no security, we believe that organizations shouldn’t have to compromise when it comes to securing their data. That’s why our solution is designed to address the shortcomings we’ve identified in other tools.
Nightfall provides a complete secrets detection solution
1. Machine learning expands both the scope and accuracy of detection for Nightfall
At its heart, the Nightfall platform is a machine learning platform built from the ground and centered around protecting the secrets, credentials, and data organizations use in their everyday work. To this end, we’ve built over a hundred detectors specifically tuned to the types of business-critical data that can often be left unattended in cloud systems. With our GitHub specific integrations, this translates to the ability to accurately detect API keys for nearly any service as well as encryption keys and certificates more generally. Because each of our detectors is specifically fed a dataset that includes tokens as well as context, our tests have consistently found our platform to have both a low false positive rate and a low false negative rate. Nightfall also permits the use of an allow list to exclude true positives that for whatever reason are meant to remain in your codebase such as credentials in test repositories.
2. Nightfall can be implemented at any part of your security process
Nightfall provides users with the ability to conduct scans manually as well as schedule scans in advance. Scheduled scans can be recurring and allow for the implementation of automatic periodic code reviews within your repos and across your organization. The Nightfall platform also can be used to detect secrets before and after pull requests as well. In this way, you get coverage on all your historical data as it exists today in GitHub, as well as proactive protection going forward as developers are merging in new code.
What does Nightfall detect that’s relevant to protecting credentials and secrets?
Nightfall’s detectors are suited to detect 200+ types of credentials & secrets in both structured & unstructured data, like messages and code files. These include things like API keys, tokens, encryption keys, cookies, UUIDs, and other identifiers for platforms like AWS, GCP, Azure, Slack, Stripe, Twilio, Heroku, and many other popular services. Nightfall’s detectors are trained and tuned on vast amounts of data, so they work well out of the box – you don’t need to specify the exact types of credentials & secrets you are looking for.
Understand the options for secrets detection
As we’ve shown, not all secrets detection tools are created equal. You’ll need to ensure that you really understand your use case to effectively assess which tools will adequately address your security risks within GitHub or other code repositories. If you want to learn more about the problem of secrets or credential leaks, read our Guide to Secrets Detection on GitHub or download our webinar on protecting codebases from secrets exfiltration. You can also watch the video below to learn how Drupal hosting company Acquia protects their codebases with Nightfall.
Subscribe to our newsletter
Receive our latest content and updates
Nightfall is the industry’s first cloud-native DLP platform that discovers, classifies, and protects data via machine learning. Nightfall is designed to work with popular SaaS applications like Slack, Google Drive, GitHub, Confluence, Jira, and many more via our Developer Platform. You can schedule a demo with us below to see the Nightfall platform in action.
Schedule a Demo
Select a time that works for you below for 30 minutes. Once confirmed, you’ll receive a calendar invite with a Zoom link. If you don’t see a suitable time, please reach out to us via email at firstname.lastname@example.org.