TL;DR
- Hyperscan and re2 are 24x to 40x faster than the default Go regexp library and use 7x less memory
- Hyperscan and re2’s performance with this specific set of patterns and test data were fairly comparable, with re2 being up to 1.4x faster than hyperscan.
Skip below to the results, otherwise read on to learn more about our specific use case.
Context
At Nightfall, our mission is to detect sensitive data. Today, we detect 50+ types of sensitive data, including social security numbers, API keys, street addresses, and more. Practically, this means building a content inspection pipeline that can:
- Ingest arbitrary file types, whether they are plain text, zip files, or jpegs, through our public API and various SaaS integrations like Slack or GDrive
- Scan terabytes of data to identify sensitive information
- Deliver actionable results to our customers without giving them alert fatigue
But how do we identify sensitive information in arbitrary text data?
The answer is: it’s complicated. At a high level, we use (a) purely ML based NLP models and (b) a combination of machine learning techniques, validation heuristics, and regular expressions. This post will focus on how we optimized the regular expression use case of approach (b) to detecting sensitive information.
We start off by filtering the input data by running it through a list of hard-coded regexes to pull out the data most likely to be sensitive. This step is critical for scanning data at scale because it limits the data we run through computationally expensive ML models. For instance, if we are scanning a long history of chat messages between a customer support agent and a customer to see if any sensitive information like credit card numbers were leaked, we can use regexes to pull out only the chat messages with 16 digit numbers to limit the data we feed to our ML models, which make the final decision.
We initially used the Go standard library regexp package for our regular expression needs, but as we scaled we quickly ran into bottlenecks in speed and resource utilization. To better handle this increasing scale, we began investigating alternative regex libraries. We started by benchmarking other regular expression libraries, notably Google’s re2 and Intel’s hyperscan.
Benchmark Setup
Our benchmark used three file sizes that represent typical input: 10 KB, 100 KB, and 500 KB. The files were made up of generated random English dictionary words with numbers up to 16 digits sprinkled in 10% of the time to simulate potential identifiers like credit card numbers. Here’s an example snippet from one of the files:
...surmise internalness svelte multiloquent bel predevelop uncheck transparence intellectual oakenshaw 2259010711233889 provostship loosestrife...
We benchmarked three regex libraries: regexp, intel/hyperscan, and google/re2. The regexp library is native to Go, hyperscan is written in C, and re2 is written in C++. We used an open-source Go binding package for hyperscan, and we wrote our own bindings for re2.
Each benchmark loaded the test input file into memory, ran 47 regex patterns against the input, and stored the resulting matches. Both hyperscan and re2 support pre-compiling all 47 regex patterns while regexp did not. Therefore, hyperscan and re2 benchmark functions were able to scan all the regexes in one pass with a single function call. Because the regexp library does not support matching multiple regex patterns at once, we had to run each regex pattern individually with a goroutine for each pattern.
The benchmarks were run on a 2019 16-inch Macbook Pro with 2.6 GHz 6-Core Intel Core i7 processor, 16 GB 2667 MHz DDR4 of memory, and an AMD Radeon Pro 5300M 4 GB graphics card using go test -bench . -benchmem. My Go OS is darwin and Go Arch is amd64. This command runs the benchmarks multiple times automatically until they converge into a consistent speed so that they are resilient to any variations in performance that may be due to other programs running in the background.
The speed comparison should be an apples-to-apples comparison for the most part. The memory usage comparison, however, isn’t quite the same. For instance, re2 returns all the matches back at once while hyperscan returns them one at a time with a callback function.
Re2 returns a []Matches.
type Range struct { Start int, End int}
type Matches struct { Index int, Ranges []Range}
Hyperscan returns the arguments from the following callback. In this benchmark, I manually constructed a map[uint][]match stored matches as they were returned.
Hyperscan returns ranges with this callback function.
func(id uint, from, to uint64, flags uint, context interface{}) error
Benchmark Results
Conclusions
The default Go regexp library was definitely the slowest and most memory intensive. re2 and hyperscan were about 30x to 40x faster and used up to 7x less memory.
Hyperscan and re2’s speed were quite similar with re2 being 1.12x to 1.4x faster depending on the file size. As file size increased, the speed gap between them narrowed
Hyperscan typically used around 2.7x to 3.9x the memory as re2. Also note that re2 returns all matches while hyperscan returns each match in a callback function, which may or may not better suit your use case. Hyperscan is also optimized for Intel processors, so running these benchmarks on a different CPU would also affect results.