In our modern digital age, organizations routinely provide public-facing resources on the internet. In doing so, it also creates opportunities for unintentional exposure that manifests into cybersecurity vulnerabilities, exploits, and threats. As part of our research for our recently released Industry Cyber-Exposure Report: Fortune 500 to better inform cyber-risk reduction efforts, we made an attempt at quantifying the public exposure of various organizations, aggregating our findings, and segmenting the results by industry.

We went about assessing exposure by utilizing a combination of Project Sonar and Project Heisenberg. Project Sonar performs active scans across the addressable internet space to identify vulnerabilities, while Project Heisenberg passively monitors for opportunistic or unintentional inbound connections. By design, there is no legitimate reason for organizations to try to connect to the Heisenberg sensor network—attempts to do so can generally be regarded as an indication of some type of malicious exploitation or service misconfiguration. By combining the results of the two systems, we arrive at a pretty rich dataset of internet addresses and vulnerabilities.

Once we have the data, the next task is to relate internet addresses to organizations so we can make assertions about their security postures. We reference WHOIS databases maintained by internet registrars that contain details on addresses and the names of their owners.

The data is not static

Attributing addresses to organizations by name is unfortunately confounded by the organic movements and transformations of organizations over time. Companies routinely merge, branch, split, rebrand, and localize, resulting in permutations and distortions of names within the WHOIS database. In many cases, there are dozens or hundreds or more variations of names that all relate to a single organization.

For instance, Apple might variably appear within the WHOIS database as Apple, Apple Computer, Apple Inc., Apple Incorporated, apple -081024213650, and so on. While a human eye can easily decipher the names and make a reasonable interpretation that all the variations refer to the same entity, a computer is simply not as adaptive and flexible.

The obvious solution is to apply a manual remedy: assign people to scan the set of names, and group together records from the Sonar and Heisenberg data collection efforts that appear to belong to the same organization. But with millions of records in the WHOIS database and the Sonar and Heisenberg data, this becomes a highly monotonous, repetitive, and, quite frankly, unpleasant task.

Entity resolution

Data science methods do exist that can dynamically incorporate human feedback to quickly and efficiently aid us in the task of grouping organization names. In particular, entity resolution—a method that connects records despite the lack of keys or exact matches—can be used to link names despite significant variations.

For instance, with some human input, an entity resolution model can be trained to understand that a record for Bob J. Smith at 1 Main St. refers to the same real-world entity as a record for Robert Jones Smith at One Main Street. If we apply the same concept to WHOIS names, we can arrive at a system that can equate Apple Computer, Inc. with Apple Inc dynamically, while also differentiating Apple Incorporated from Apple Car Service.

There are many ways to handle entity resolution. One approach is to use a combination of unsupervised methods—such as clustering techniques—to group similar records, and supervised classification methods—such as logistic regressions or random forests—to determine whether two or more records actually refer to the same thing.

The secret sauce for effective entity resolution is an active-learning, human-training component in which human input provided through a series of responses to some questions posed by the model construction mechanism becomes a training dataset. The model that is ultimately generated extracts the underlying intuition contained within the training dataset. Once trained, the model can be deployed to perform a matching task enriched with human feedback, with great efficiency.

For instance, a clustering method might scan through the full range of names and initially group Apple Inc., Apple Computer, Inc., and Apple Foods—the common thread being the presence of the word, “Apple”. After incorporating some human training, the classification method would understand that records with the term “Food” are different from records with “Computer” or “Incorporated” and determine that “Apple Food” refers to a different organization than the other two records.

The active learning process might present a few dozen questions such as the following, to which a response would be incorporated into a dataset to train the matching model:

name : apple store-061109083922

name : apple computer-061109022525

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


name : apple inc

name : apple physical therapy

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

In essence, a mixed method of combining human input and machine learning achieves the best of both worlds: the precision and adaptiveness of the human mind, and the speed and efficiency of machines. Such an approach is highly dependent on the training data, so there might be some misclassifications along the way, but the model’s accuracy improves with more human input. If you’re interested in giving this a shot, check out the excellent Python dedupe package.

Here is a bit of code that demonstrates what a bare-bones implementation of dedupe might look like:

import dedupe
import csv

# Read data. Source contains a single field called 'name'
whois_data = {}
with open("input_data.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        clean_row = [(k, v) in row.items()]
        row_id = int(row['Id'])
        whois_data[row_id] = dict(clean_row)

# Define the fields dedupe will pay attention to and how
fields = [
    {'field' : 'name', 'type': 'String'},
    {'field' : 'name', 'type': 'Exact'},
    ]

# Create deduper object
deduper = dedupe.Dedupe(fields)

# Sample data for training
deduper.sample(whois_data, 10000)

# Active learning. This brings up an interactive console instance
dedupe.consoleLabel(deduper)

# Using the examples we just labeled, train the deduper and learn
deduper.train()

# When finished, save our training to disk
with open("training_file.json", 'w') as tf:
    deduper.writeTraining(tf)

# Calculate threshold to balance precision and recall
threshold = deduper.threshold(whois_data, recall_weight=1)

# Create matched records object
matched = deduper.match(whois_data, threshold)

# Generate dictionary assigning records to clusters
cluster_membership = {}
cluster_id = 0
for (cluster_id, cluster) in enumerate(matched):
    id_set, scores = cluster
    cluster_d = [whois_data[c] for c in id_set]
    canonical_rep = dedupe.canonicalize(cluster_d)
    for record_id, score in zip(id_set, scores):
        cluster_membership[record_id] = {
            "cluster_id" : cluster_id,
            "canonical_representation" : canonical_rep,
            "confidence": score
        }

The generated dictionary contains the following:

  • A cluster_id that can appear with multiple records in the source data. Records that share common cluster_id values are believed to refer to the same entity.
  • A canonical_representation, which is a standardized name for a set of records. For instance, records for Apple Computer, Inc. and Apple Incorporated might be standardized as Apple.
  • A confidence value that reflects how certain the model is that a particular records belongs with a particular group. The value ranges from 0 to 1, where values close to 1 indicate a high degree of certainty for a match.

Check out a dedupe example for some more detailed sample code.

In our Industry Cyber-Exposure Report: Fortune 500, entity resolution enables us to repeatedly find the few dozens or hundreds of records within the original mound of millions that refer to the same organizations in relatively trivial amount of time.

Security has traditionally operated in a fairly manual, task-focused, deterministic manner. While that methodology is a staple that is unlikely to go away anytime soon, security efforts can be significantly bolstered with the judicious incorporation of data science methods. Data science can take a task like WHOIS record-matching that once might have caused great consternation among security practitioners, and transform it into a reasonable, pro forma task.

What does exposure look like for corporate America? Find out in our latest research report.

Get Started