Tales of a Postgraduate Nothing

How I made $10K in bug bounties from GitHub secret leaks

Tillson Galloway — Sun, 10 May 2020 21:05:56 GMT

API keys, passwords, and customer data are accidentally posted to GitHub every day.

Hackers use these keys to login to servers, steal personal information, and rack up absurd AWS charges. GitHub leaks can cost a company thousands–or even millions–of dollars in damages. Open-source intelligence gathering on GitHub has become a powerful arrow in every security researcher's quiver: researchers from NC State even wrote an academic paper on the subject.

This article, written for both bug bounty hunters and enterprise infosec teams, demonstrates common types of sensitive information (secrets) that users post to public GitHub repositories as well as heuristics for finding them. The techniques in this article can be applied to GitHub Gist snippets, too.

In the last year, I've earned nearly $10,000 from bug bounty programs on HackerOne without even visiting programs' websites thanks to these techniques. I've submitted over 30 Coordinated Disclosure reports to vulnerable corporations, including eight Fortune 500 companies.

I've also released GitHound, an open-source tool designed to automate the process of finding keys across GitHub. GitHound isn't limited to a single user or organization: it sifts through all of GitHub, using Code Search queries as an entrypoint into repositories and then using context, regexes, and some other neat tricks to find secrets.

GitHub Code Search

Before we get into the automated tools and bug bounty strategies, let's talk about Code Search.

GitHub provides rich code searching that scans public GitHub repositories (some content is omitted, like forks and non-default branches). Queries can be simple like uberinternal.com or can contain multi-word strings like "Authorization: Bearer". Searches can even target specific files (filename: vim_settings.xml) or specific languages (language:SQL). Searches can also contain certain boolean qualifiers like NOT and >.

Knowing the rules of GitHub code search enables us to craft search dorks: queries that are designed to find sensitive information. GitHub dorks can be found online, but the best dorks are the ones that you create yourself.

For example, filename: vim_settings.xml (try it!) targets IntelliJ settings files . Interestingly, the vim_settings.xml file contains recent copy-pasted strings encoded in Base64. I recently made $2400 from a bug bounty with this dork: SaaS API keys and customer information were exposed in vim_settings.xml.

vim_settings.xml only contains recently copy-pasted strings, but we can exploit the repository's commit history to find the entire copy-paste history. Just clone the repository and run this 14-line script, and the user's activity will be at your fingertips. GitHound also finds and scans base64 encoded strings for secrets, even in commit history.

By the way: with a GitHub commit search dork, we can quickly scan all 500,000 of commits that edit vim_settings.xml.

Search Heuristics for Bug Bounty Hunters

GitHub dorks broadly find sensitive information, but what if we want to look for information about a specific company? GitHub has millions of repositories and even more files, so we'll need some heuristics to narrow down the search space.

To start finding sensitive information, identify a target.

I've found that the best way to start is to find domains or subdomains that identify corporate infrastructure.

Searching for company.com probably won't provide useful results: many companies release audited open-source projects that aren't likely to contain secrets. Less-used domains and subdomains are more interesting. This includes specific hosts like jira.company.com as well as more general second-level and lower-level domains. It's more efficient to find a pattern than a single domain: corp.somecompany.com, somecompany.net, or companycorp.com are more likely to appear only in an employee's configuration files.

The usual suspects for open-source intelligence and domain reconnaissance help here:

Subbrute - Python tool for brute-forcing subdomains
ThreatCrowd - Given a domain, find associated domains through multiple OSINT techniques
Censys.io - Given a domain, find SSL certificates using it

GitHound can help with subdomain discovery too: add a custom regex \.company\.com and run GitHound with the --regex-file flag.

After finding a host or pattern to search, play around on GitHub search with it (I always do this before using automated tools). There are a few questions I like to ask myself here:

How many results came up? If there are over 100 pages, I'll likely need to find a better query to start with (GitHub limits code search results to 100 pages).
What kind of results came up? If the results are mostly (intentionally) open-source projects and people using public APIs, then I may be able to refine the search to eliminate those.
What happens if I change the language? language:Shell and language:SQL may have interesting results.
Do these results reveal any other domains or hosts? Results in the first few pages will often include a reference to another domain (e.g. searching for jira.uber.com may reveal the existence of another domain entirely, like uberinternal.com).

I spend most of my time in this step.

It's crucial that the search space is well-defined and accurate. Automated tools and manual searching will be faster and more accurate with the proper query.

Once I find results that seem interesting based on the criteria above, I run it through GitHound with --dig-files and --dig-commits to look the entire repository and its history.

echo "uberinternal.com" | ./git-hound --dig-files --dig-commits

echo "uber.com" | ./git-hound --dig-files --language-file languages.txt --dig-commits

echo "uber.box.net" | ./git-hound --dig-files --dig-commits

GitHound also locates interesting files that simply searching won't find, like .zip or .xlsx files. Importantly, I also manually go through results since automated tools often miss customer information, sensitive code, and username/password combinations. Oftentimes, this will reveal more subdomains or other interesting patterns that will give me ideas for more search queries. It's important to remember that open-source intelligence is a recursive process.

This process almost always finds results. Leaks usually fall into one of these categories (ranked from most to least impactful):

SaaS API keys - Companies rarely impose IP restrictions on APIs. AWS, Slack, Google, and other API keys are liquid gold. These are usually found in config files, bash history files, and scripts.
Server/database credentials - These are usually behind a firewall, so they're less impactful. Usually found in config files, bash history files, and scripts.
Customer/employee information - These hide in XLSX, CSV, and XML files and range from emails all the way to billing information and employee performance reviews.
Data science scripts - SQL queries, R scripts, and Jupyter projects can reveal sensitive information. These repos also tend to have "test data" files hanging around.
Hostnames/metadata - The most common result. Most companies don't consider this a vulnerability, but they can help refine future searches

Workflow for Specific API Providers

Dorks can also be created to target specific API providers and their endpoints. This is especially useful for companies creating automated checks for their users' API keys. With knowledge of an API key's context and syntax, the search space can be significantly reduced.

With knowledge of the specific API provider, we can obtain all of the keys that match the API provider's regex and are in an API call context and then we can check them for validity using an internal database or an API endpoint.

A workflow for finding secrets for a single API provider

For example, suppose a company (HalCorp) provides an API for users to read and write to their account. By making our own HalCorp account, we discover that API keys are in the form [a-f]{4}-[a-f]{4}-[a-f]{4}.

# Python
import halapi
api = halapi.API()
api.authenticate_by_key('REDACTED')

# REST API with curl
curl -X POST -H "HALCorp-Key: REDACTED" https://api.halcorp.biz/userinfo

Armed with this information, we can compose our own GitHub dorks for HalCorp API responses:

# Python
"authenticate_by_key" "halapi" language:python

# REST API
"HALCorp-Key"

With a tool like GitHound , we can use regex matching to find strings that match the API key's regex and output them to a file:

echo "HALCorp-Key" | git-hound --dig-files --dig-commits --many-results --regex-file halcorp-api-keys.txt --results-only > api_tokens.txt

Now that we have a file containing potential API tokens, and we can check these against a database for validity (do not do this if you don't have written permission from the API provider).

In the case of HalCorp, we can write a bash script that reads from stdin, checks the api.halcorp.biz/userinfo endpoint, and outputs the response.

cat api_tokens.txt | bash checktoken.bash

Remediation

Although awareness of secret exposure on GitHub has increased, more and more sensitive data are published each day.

Amazon Web Services have begun notifying users if their API keys are posted online. GitHub has added security features that scan public repositories for common keys. These solutions are merely bandaids, however. To limit secret leaks from source code, we must update API frameworks and DevOps methodologies to prevent API keys from being stored in Git/SVN repositories entirely. Software like Vault safely stores production keys and some API providers, like Google Cloud Platform, have updated their libraries to force API keys to be stored in a file by default.

Fully eradicating exposure of sensitive information is a more difficult problem: how can customer information be fully detected? What if it's in a Word, Excel, or compiled file? More research must be conducted in this field to study the extent of the problem and its solution.

Nothing Lasts, but Nothing is Lost: Taoist Ideology in Slaughterhouse Five’s Tralfamadorians

Tillson Galloway — Sat, 15 Dec 2018 23:33:43 GMT

“The wise does not speak. He who speaks is not wise." – Lao Tzu

The Tralfamadorian worldview of Slaughterhouse Five fame surrenders to time's unrelenting waters, its central axiom being that “the moment simply is.” ^[1] Tralfamadorian physics lead us to question our perceptions of time. If past, present, and future all happen in one moment, can anything start or stop existing? If we die, what happens to our consciousness? Is essence conserved like energy?

Lao Tzu describes the Tao as “complete and perfect as a wholeness” existing “everywhere and anywhere” as the “eternal law.” ^[2] The Tao exists as one indivisible entity and flows like water: it “benefits all things and contends not with them." Water, vital for life, is uncaring of anything that stands in its way. The water in Taoism symbolizes the eternal "oneness" that encapsulates space and time. In any competition held with respect to time, water wins in the end. Following the Tao symbolizes embrace and submission to the water; it represents the art of doing without doing. It represents just living your life.

Tralfamadorians accept the nature of the Tao, the notion that time is one entity, as opposed to the supposed human “illusion that one moment follows another one like beads on a string, and that once a moment is gone it is gone forever.” ^[1] For Tralfamadorians, the Tao is embodied by the concept of a moment. We see this further exemplified when the Tralfamadorians explain death to Billy. “When a Tralfamadorian sees a corpse,” they explain, “all he thinks is that the dead person is in a bad condition in that particular moment, but that the same person is just fine in plenty of other moments.” ^[3] Tralfamadorian thanatology, it appears, aligns with the Tao. The contrast between the Tralfamadorian metaphor for time, “a bug trapped in amber” ^[1] and the metaphor of water present throughout the Tao Te Ching presents a powerful juxtaposition. Vonnegut’s Tralfamadorians possess a cynical but rational view of time, portraying the universe as "trapped" within its amber, while Taoists paradoxically experience it as an eternal flow with divine beauty. Both agree, definitively, that time is inescapable and is unconcerned with the affairs of Creation.

Billy Pilgrim, the “unstuck in time” subject of Slaughterhouse Five, exists in his own purgatory, crossed between Newtonian and Tralfamadorian spacetime. He has lived his life time and time again, living memories ad hoc. He remains stoic throughout Five, knowing the outcomes of any situation up to and including his death, so it goes. But Billy understands the Moment, the Tao, better than anyone else. Whether he is a WWII prisoner of war or an exhibit in a Tralfamadorian zoo, Billy does not try to change the Moment: he lives it.

[1] Chapter 4, Vonnegut's Slaughterhouse Five.

^[2] Chapters 2, 4, and 16, respectively, of Lao Tzu's Tao Te Ching.

^[3] Chapter 2, Vonnegut's Slaughterhouse Five.

Running an Introductory Level CTF: Insight into Porter-Gaud CTF 2017

Tillson Galloway — Sat, 15 Dec 2018 23:12:19 GMT

On 28 January, we (Charles Truluck, Cameron Hay, and myself) ran the inaugural Porter-Gaud CTF competition, an introductory cyber-security red team competition designed to give high schoolers from around South Carolina a hands-on experience with security concepts.

Inception

After competing in the NodeSC 2016 CTF and PCDC here in Charleston, we immediately knew that we wanted to run our own capture-the-flag competition. We went to Doug Bergman, the computer science chair at Porter-Gaud, and began to make the idea a reality.

Designing the Problems

My personal Computer Science III (a semester class at Porter-Gaud) was to design and create the competition, which proved to be time-consuming.

In this picture, you can see the Jeopardy board that lived in the Porter-Gaud computer lab from September-January with all of the problems from the competition, along with a few that did not make it to the competition for various reasons (difficulty, time constraints, too many problems)

After developing a problem, we would test it on each other for difficulty and usability.

Reaching Out

The hardest part of organizing PGCTF was reaching out and getting teams to sign up.
We officially announced the competition in early-November, but we did not have enough teams to successfully run the competition until early January. Teams continued rolling in up until four days before the competition.
Next year, we're planning to improve our outreach.

Tech Week

(warning: technical section)

The final step in putting together the CTF was setting up the infrastructure.
While we initially planned to have multiple servers (scoreboard, web problems, file server), we abandoned this idea due to a lack of resources and used one server running Ubuntu.

Quality Assurance

During the pre-competition week, we made it a goal to work through all of the problems to ensure that they were solvable, the anticipated difficulty, and fun to solve.
This led us to change point values of a few problems (ADFGX, for example).

Problem Servers

We used Docker to handle having multiple interfaces and therefore multiple IP addresses leading to different websites. Docker also allowed for us to isolate problems for each other due to the exploitative nature of the competition (for example, Imgur 2.0 was solvable by obtaining a shell on the host server)
For the web problems and Algorithm 1, we had to host webservers. We didn't want to bloat the ports on one host, so we used a trick with Docker and interfaces to create an illusion of there being multiple servers.

In order to create these interfaces, we used this guide, which led to:

$ ip link add icantc type bridge
$ ip addr add 10.0.0.21 dev icantc
$ ip link set icantc up

Becuase of an issue with the /etc/network/interfaces file (most likely caused by syntax issues) discovered the night before the competition, we wrote a script that we would run whenever the master server came up.

Dockerfiles

In order to set up the problem servers, we had to set up Dockerfiles for each of the web-based problems. While Apache/PHP Dockerfiles were pretty easy, we were unable to fully setup NodeJS ones. We wound up having to manually start the NodeJS instance within the Docker containers on competition day.

Wifi or Ethernet?

During the pre-competition week, we had to build the full network. The biggest disagreement between the PGCTF Team was whether to use WiFi or to stick to Ethernet. While Wifi would be more convenient, it was more prone to breaking during the course of the competition. Ethernet would be more stable, but users with newer computers may require an adapter.We eventually settled on using Wifi with a Switch in the room if a team opted to use Ethernet.

The Nightmare of DHCP

One problem that came up after creating these interfaces was that computers connecting to the access point would be assigned the same IP as one of our defined interfaces on the server, and would not be able to connect to anything (and would break the problem). We solved this by modifying the DHCP scope to exclude the 10.1.10.20-30 IP range, which was dedicated to problems.

Here's an MS-Paint rendition of the network diagram.

Documents

On Friday night, we finalized the competition scope and printed the documents that were distributed to each team with rules/schedule/scope of the competition. We also finalized the intro powerpoint and the itinerary for the day.

Competition Day

Arriving at Porter-Gaud around 8:45 for the 9:00 start, I was impressed to see that teams had already begun arriving and were setting up their team spaces.
As I walked in and noticed the list of IP addresses both printed on each team's table and on the whiteboard, I realized that the web servers were all up and could be exploited before the intended start time. Luckily, I realized this before the teams did and shut them down.
At 9:15, about fifteen minutes before the originally planned start time, all teams were ready to go and we began the competition.

Though there were a few mishaps throughout the day (a file missing from a problem, for example), everything ran extremely smoothly.
Teams solved the Tillson Galloway recon problem much faster than I expected them to, and it was very rewarding watching teams finally solve the last step of the problem.
The D in Detroit was also a favorite of both organizers and participants, and all teams eventually solved it (though the room was filled with static noises for the first two hours).

While I expected the competition to die down around 1:30 as in past CTFs I've competed in, teams stayed interested and kept the fight for first/second place intact until the last minute.

After the Competition

Survey

A few days after the competition, I sent out an email with resources to future CTFs and an anonymous survey for PGCTF. The survey truly will help us outline next year's competition as it was the most direct way to give feedback.

A couple of stats from the survey:
Favorite problems were:
* ADFGX (Crypto)
* Tillson Galloway (Recon/Misc)
* The D in Detroit (Forensics)

Afterthoughts

Lessons Learned

Proofreading and testing the problems before the competition is essential. There were two instances throughout the day where a web problem link was either wrong or the problem was missing entirely.

We also found that one area we can improve in is evaluating problem difficulty. There were instances where problems that were meant to be easy ended up being much harder than anticipated, which steered teams away from the category at large (E-Corp Internal and ArchLinux, for example). We'll try to provide better resources for training next year, and we may release challenge problems throughout the year.

Next Year

Next year, we also want to have a more interactive experience the day of the game, expanding upon the MS08-067 Windows XP image released mid-game in order to keep the game lively.

What's Next?

The Dangling Pointers will be competing in EasyCTF next week as well as The Palmetto Cyber Defense Competition on 8 April.

Thank You

I'd like to close this writeup with a thank you to the following people for their help with the event:

Charles Truluck, for both helping build and setup the server as well as for building and testing problems
Cameron Hay, for work on the cryptography category and support for teams on the day of the event.
Bryan Luce and Doug Bergman, for all around support and motivation for running the event, advising us during the production of it, and handling communication with the school administration in order to legitimize the event.
Phil Zaubi, for setting up the final networking component of the event.

Originally published 2 March 2017