3 Critical Lessons from 2020’s Largest GitHub Leaks

2020 has been a very challenging year for teams and organizations across the world. This has been especially true for security teams, who’ve been responsible for managing the technological risks associated with their organization’s response to the pandemic. With security teams focused on mitigating the seismic impacts that the pandemic has had on their organization’s infrastructure, some of the security problems that emerged before the pandemic have been overlooked. This year a lot of focus has been placed on the adoption of new cloud tools like Zoom, Slack, and the many other platforms that were critical for enabling remote work widely across organizations. However, as a result, problems that have been plaguing organizations for years — like data leakage from cloud platforms such as Git-based codebases — have flown somewhat under the radar this year. This problem has been illustrated by a number of high profile news stories about companies exposing data within code repositories. Let’s look at five important stories and reflect on lessons that developers and security teams should take away from them.

Major Canadian telecom Rogers Communications leaves source code up, possibly for years

Near the beginning of the year, security researcher Jason Coulls discovered two open accounts with source code, user names, passwords, and private keys for Rodgers Communications. The source code seems to date back to 2015, making it unclear how much of this code is deprecated. Rogers has issued statements to The Register and IT World Canada, both of which reported on Coulls’ findings, indicating that no customer data has been compromised and the exposure posed no security risk. Coulls, however, later found five more open folders with a limited number of device identifiers and phone numbers. Coulls also suggested that the real boon for would-be thieves is that this code could provide insight into Rogers’ system architecture or potential weaknesses in the ISP’s website. This might still be true even if this particular code is no longer in production.

An AWS employee committed secrets to a public personal repository

In late January, Gizmodo broke a story about an AWS DevOps Cloud Engineer who committed nearly a gigabyte’s worth of data to a personal GitHub repository bearing their own name. Although the contents of this repository weren’t completely known, researchers at UpGuard determined that the data likely was either a risk to either the employee or to their employer, Amazon. Among the data were confidential documents as well as lists of AWS and RSA key pairs. This particular story highlights the risks posed by individual actors who, either intentionally or unintentionally move their work to environments outside their organization.


One story suggests that leaks still go unnoticed and happen frequently

This summer, the outlet BleepingComputer reported on a developer and cybersecurity enthusiast who manages a public repository containing leaked source code from dozens of companies across multiple industries. The code featured in this repo comes from names like Adobe, Qualcomm, Motorola, GE Appliances, Nintendo, Roblox, and many others. In some cases, this code is years old or open source. However, in other instances, the code might be proprietary and of extreme importance, as several companies have issued copyright notices to request that their code be removed from the repository. Despite the varied nature of the code in this repository, it very likely serves as a microcosm of the broader trends affecting companies adopting cloud-based repositories.

An August story revealed that organizations are still hard-coding secrets within codebases

In August of this year, it was revealed that 9 U.S. based healthcare organizations leaked protected health information (PHI) for at least 150,000 patients. Some estimates put the number closer to 200,000. The leaked data was discovered by Dutch security researcher Jelle Ursem who discovered 9 separate incidents of improper practices like hard-coded login credentials within code, using public repositories for production, failing to activate two-factor for sensitive email accounts, and leaving repositories active when they’re no longer being used. Ursem claims he was able to find some exposed data within 10 minutes with variations of simple search terms like “Medicaid password FTP,” suggesting that these orgs might not even have security best practices for developers in place.

A file upload exposed the data of 16 million COVID-19 patients

In late November, an employee of a Brazilian hospital uploaded to their personal GitHub a spreadsheet with usernames, passwords, and access keys that provided credentials to government databases containing the information of Brazilian COVID-19 patents. The databases whose credentials were leaked contained names, addresses, symptoms, and medical history. Among those who had their data exposed were president Jair Bolsonaro, seven ministers, and 17 provincial governors. The Brazilian government was eventually notified of the exposure, the spreadsheet was removed from GitHub and the exposed access keys have been revoked.

What are the 3 lessons we can learn from the biggest GitHub leaks of 2020?

Collectively these breaches teach us very significant lessons about the causes and consequences of GitHub leaks, and what developers should continue watching for going into 2021 and beyond. These lessons include:

1. Organizations need to know when code is moved to repositories outside their environments.

When an employee decides to move code or upload data to a personal repository, this arguably poses a greater risk to an organization than sensitive data exposure that occurs within their own environment given that they can’t have any visibility or control over sensitive data in such a context. In two of the stories we shared, one involving an AWS engineer and the other involving a Brazilian healthcare employee, the employers affected by each breach were only notified once the data was discovered by a third party. This highlights just how important it is to have visibility into the activities going on within your GitHub organization, and ensuring your team members understand proper practices.

2. Even code no longer in production can be dangerous if exposed.

This lesson, which specifically comes to us from the Rogers story, is another important one to remember. Although obsolete code and revoked credentials pose no direct threat to your systems, as Jason Coulls suggests, any code regardless of its status provides information that threat actors wouldn’t have otherwise. An attacker could learn, for example, the languages you use to architect your systems or how your web applications interact with your databases. Unless your organization radically changes its system architecture very frequently, it’s likely that this information could be useful in helping bad actors plan a future attack.

3. Organizations must create and enforce best practices for their developers to follow.

Many of the leaks on this list involved developers hard-coding credentials, committing code to repositories that were public facing, or otherwise engaging in poor practices. While developers should internalize better practices, like ensuring they don’t commit code containing secrets, organizations should standardize development practices across their teams and then enforce these standards. Tools like Nightfall DLP can give you the visibility to scan historical commits for secrets and credentials and ensure that future commits don’t contain such content.

About Nightfall

Nightfall DLP is a data loss prevention platform design to secure SaaS and cloud infrastructure. Nightfall helps developers with tools like the Nightfall DLP GitHub Action, which allows them to scan pull requests for secrets and credentials before committing code. The GitHub action will post review comments alerting developers to any issues with their code. For those looking for DLP protection for their entire GitHub organization, Nightfall DLP is also available as a direct integration into your GitHub environment. Both services leverage Nightfall’s 200+ machine learning-based detectors, trained to identify common types of personal information, including industry-specific data like credit card numbers, bank accounts, medical IDs, and much more. You can learn more about Nightfall by scheduling a demo with us below.

Share this post: