How to Scan GitHub Repositories for Committed Secrets and other Code Snippets
In 2019, GitHub estimates that over 44 million repositories were created, and over 10 million new developers joined the platform. This comes as no surprise, as GitHub is the world’s largest host of source code. With that designation comes a substantial volume of committed code. While cloud-based version control platforms like GitHub are a boon for organizations seeking to productively manage large distributed teams, such environments can make it incredibly easy for mistakes, like hard-coded credentials or other types of exposed secrets, to proliferate. As such, many teams have begun seeking ways to quickly search their repositories for such content. In this post, we’re going to give a quick overview of the options you have for conducting this process.
Can I use GitHub search to find exposed credentials?
GitHub provides search functionality from its site, and if you know what you’re looking for, it could be an excellent start for conducting a quick on-the-spot manual review. GitHub search uses an ElasticSearch cluster to index projects every time a change is pushed to GitHub, meaning that through GitHub’s search, you can find publicly available code on all of GitHub or narrow results to a particular organization or repository. GitHub’s search functionality, at least when searching publicly, does have some limitations, especially around searching code in forks, repositories exceeding 500,000 files, or in branches other than the default. However, this option is not intended to be comprehensive but instead provide a team with a quick way to scan short code snippets in recent memory.
Additionally, many users have developed command line interfaces, like gh-search-cli that let you use Git commands to conduct searches of code within repositories on GitHub.GitHub itself is also developing a CLI, but it’s currently in beta with a limited number of features and does not yet have full search functionality but could in the future. However, for more robust and comprehensive secrets detection, you’ll likely want to use one of the more dedicated tools that have become popular in recent years.
What do most dedicated secrets detection tools have in common?
For more comprehensive searching functionality, there are a growing number of tools tuned to finding specific types of credentials and secrets within repositories. These include tools like truffleHog, Auth0’s Repo Supervisor, AWS’s Git Secrets, Yelp’s Detect Secrets, or the UK Home Office’s Repo Security Scanner. We’ve covered some of these tools and many others like them before. For a more in depth look, you can read our post detailing Radar, our own solution, or read our guide to secrets detection. We’ll briefly cover the scope and limitations of these tools broadly below.
1. The bulk of these solutions are tuned to detect a limited number of secrets
Given the wide variation of secrets and credentials that can be found within a codebase, developers have gone about addressing the problem of secrets detection from a wide range of angles. The most effective tools are designed to exclusively identify a specific type of secret. For example, Git Secrets, which was developed by AWS labs, is specifically designed to detect committed AWS credentials using prebuilt regexes designed to capture unique patterns typically associated with AWS keys.
Tools that can be used to detect keys across more domains and types, like certificates and passwords, tend to use Shannon entropy thresholds to indicate whether a particular string is sufficiently random enough to resemble a secret like a password. For example, Repo Supervisor provides an entropy meter and allows users to set a threshold to determine the sensitivity of the scanner. While this can allow for more flexible detection of secrets across different categories, it tends to be a noisy detection method resulting in higher false positives and an extensive number of results to review. This leads to the second problem that many of the tools widely available today experience.
2. High false positive rates lead to inconsistent detection of secrets
Another issue that’s relatively consistent among existing tools is that they tend to have higher than expected false positive rates, even when they’re tuned to detect specific types of secrets. For example, we found that only one of the tools we tested, Gitrob, had an F1 score above 50% out of the box. An F1 score below 50% would mean that a system is poor at determining relevant data by extensively excluding positive matches and overrepresenting false positives.
It’s worth noting, though, that once tuned, truffleHog’s entropy detection method received an even higher score of 77%. However, along with that score came a very high number of false positives as the algorithm became noisier.
3. Limited scanning methods may make operationalizing security harder
A final consideration is that many of these tools are rather limited in the manner you can scan repositories. Some tools, like Git Secrets, are intended to be used before commits, while others like Gitrob are intended to look at committed code. A handful of tools, like GitLeaks, might be more comprehensive, allowing for whole org scans pre and post-commit. While users will likely find some options that map to their intended use case, their choices ultimately will be constrained by this limitation. Not to mention, the open source tools mentioned require you to setup, manage, and run them within your infrastructure.
These three limitations of existing tools result in organizations adopting solutions that only marginally satisfy their security needs. While some security is better than no security, we believe that organizations shouldn’t have to compromise when it comes to securing their data. That’s why our solution is designed to address the shortcomings we’ve identified in other tools.
Nightfall provides a complete secrets detection solution
1. Machine learning expands both the scope and accuracy of detection for Nightfall
At its heart, the Nightfall platform is a machine learning platform built from the ground and centered around protecting the secrets, credentials, and data organizations use in their everyday work. To this end, we’ve built over a hundred detectors specifically tuned to the types of business-critical data that can often be left unattended in cloud systems. With our GitHub specific integrations, this translates to the ability to accurately detect API keys for nearly any service as well as encryption keys and certificates more generally. Because each of our detectors is specifically fed a dataset that includes tokens as well as context, our tests have consistently found our platform to have both a low false positive rate and a low false negative rate. Nightfall also permits the use of an allow list to exclude true positives that for whatever reason are meant to remain in your codebase such as credentials in test repositories.
2. Nightfall can be implemented at any part of your security process
Nightfall provides users with the ability to conduct scans manually as well as schedule scans in advance. Scheduled scans can be recurring and allow for the implementation of automatic periodic code reviews within your repos and across your organization. The Nightfall platform also can be used to detect secrets before and after pull requests as well. In this way, you get coverage on all your historical data as it exists today in GitHub, as well as proactive protection going forward as developers are merging in new code.
Understand the options for secrets detection
As we’ve shown, not all secrets detection tools are created equal. You’ll need to ensure that you really understand your use case to effectively assess which tools will adequately address your security risks within GitHub or other code repositories. If you want to learn more about the problem of secrets or credential leaks, read our Guide to Secrets Detection on GitHub.