Last updated at Fri, 15 Dec 2023 15:03:00 GMT

Merry HaXmas to you! Each year we mark the 12 Days of HaXmas with 12 blog posts on hacking-related topics and roundups from the year. This year, we're highlighting some of the “gifts” we want to give back to the community. And while these gifts may not come wrapped with a bow, we hope you enjoy them.

“May you have all the data you need to answer your questions – and may half of the values be corrupted!”

- Ancient Yiddish curse

This year, Christmas (and therefore Haxmas) overlaps with the Jewish festival of Chanukah. The festival commemorates the recapture of the Second Temple. As part of the resulting cleaning and sanctification process, candles were required to burn continuously for seven days – but there was only enough oil for one. Thanks to the intervention of God, that one day of oil burned for eight days, and the resulting eight-day festival was proclaimed.

Unfortunately despite God getting involved in everything from the edibility of owls (it's in Deuteronomy, look it up) to insufficient stocks of oil, there's no record of divine intervention to solve for insufficient usable data, and that's what we're here to talk about. So pull up a chair, grab yourself a plate of latkes and let's talk about how you can make data-driven security solutions easier for everyone.

Data-Driven Security

As security problems have grown more complex and widespread, people have attempted to use the growing field of data science to diagnose issues, both on the macro level (industry-wide trends and patterns found in the DBIR) and the micro (responding to individual breaches). The result is Data-Driven Security, covered expertly by our Chief Data Scientist, Bob Rudis, in the book of the same name.

Here at Rapid7, our Data Science team has been working on everything from systems to detect web shells before they run (see our blog post about webshells) to internal projects that improve our customers' experience. As a result, we've seen a lot of data sources from a lot of places, and have some thoughts on what you can do to make data scientists' problem-solving easier before you even know you have an issue.

Make sure data is available

Chanukah has just kicked off and you want two things: to eat fritters and apply data science to a bug, breach or feature request. Luckily you've got a lot of fritters and a lot of data – but how much data do you actually have?

People tend to assume that data science is all about the number of observations. If you've got a lot of them, you can do a lot; only a few and you can only do a little. Broadly-speaking, that's true, but the span of time data covers and the format it takes are also vitally important. Seasonal effects are a well-studied phenomenon in human behavior and, by extension, in data (which one way or another, tends to relate to how humans behave). What people do and how much they do it shifts between seasons, between months, even between days of the week. This means that the length of time your data covers can make the difference between a robust answer and an uncertain one – if you've only got Chanukah's worth, we can draw patterns in broad strokes but we can't eliminate the impact seasonal changes might have.

The problem with this is that storing a lot of data over a long period of time is hard, potentially expensive and a risk in and of itself – it's not great for user or customer privacy, and in the event of a breach it's another juicy thing the attacker can carry off. As a result, people tend to aggregate their raw data, which is fine if you know the questions you're going to want answering.

If you don't, though, the same thing that protects aggregated data from malicious attackers will stymie data scientists: it's very hard, if you're doing it right, to reverse-engineer aggregation, and so researchers are limited to whatever fields or formats you thought were useful at the time, which may not be the ones they actually need.

One solution to both problems is to keep data over a long span of time, in its raw format, but sample: keep 1 random row out of 1,000, or 1 out of 10,000, or an even higher ratio. That way data scientists can still work with it and avoid seasonality problems, but it becomes a lot harder for attackers to reconstruct the behavior of individual users. It's still not perfect, but it's a nice middle ground.

Make sure data is clean

It's the fourth day of Chanukah, you've implemented that nice sampled data store, and you even managed to clean up the sufganiyot the dog pulled off the table and joyously trod into the carpet at its excitement to get human food. You're ready to get to work, you call the data scientists in, and they watch as this elegant sampled datastore collapses into a pile of mud because 3 months ago someone put a tab in a field that shouldn't have a tab and now everything has to be manually reconstructed.

If you want data to be reliable, it has to be complete and it has to be clean. By complete we mean that if a particular field only has a meaningful value 1/3rd of the time, for whatever reason, it's going to be difficult to rely on it (particularly in a machine learning context, say). By clean, we mean that there shouldn't be unexpected values, particularly the sort of unexpected value that breaks data formatting or reading.

In both cases the answer is data validity checks. Just as engineers have tests – tasks that run every so often to ensure changes haven't unexpectedly broken code – data storage systems and their associated code need validity checks, which run against a new row every so often and make sure that they have all their values, they're properly formatted, and those values are about what they should be.

Make sure data is documented

It's the last day of Chanukah, you've got a sampled data store with decent data, the dreidel has rolled under the couch and you can't get it out, and you just really really want to take your problem and your data and smush them together into a solution. The data scientists read in the data, nothing breaks this time… and are promptly stumped by columns with names like “Date Of Mandatory Optional Delivery Return (DO NOT DELETE, IMPORTANT)” or simply “f”. You can't expect their bodies to harbour that kind of thing.

Every time you build a new data store and get those validity checks set up, you should also be setting up documentation. Where it lives will vary from company to company, but it should exist somewhere and set out what each field means (“the date/time the request was sent”), an example of what sort of value it contains (“2016-12-10 13:42:45”) and any caveats (“The collectors were misconfigured from 2016-12-03 to 2016-12-04, so any timestamps then are one hour off”). That way, data scientists can hit the ground running, rather than spending half their time working out what the data even means.

So, as you prepare for Chanukah and 2017, you should be preparing for data science, too. Make sure your data is (respectfully) collected, make sure your data is clean, and make sure your data is documented. Then you can sit back and eat latke in peace.