Introduction

Over the past year, the Rapid7 Labs team has conducted large scale analysis on the data coming out of the Critical.IO and Internet Census 2012 scanning projects. This revealed a number of widespread security issues and painted a gloomy picture of an internet rife with insecurity. The problem is, this isn't news, and the situation continues to get worse. Rapid7 Labs believes the only way to make meaningful progress is through data sharing and collaboration across the security community as a whole.  As a result, we launched Project Sonar at DerbyCon 3.0 and urged the community to get involved with the research and analysis effort. To make this easier, we highlighted various free tools and shared a huge amount of our own research data for analysis.

Below is a quick introduction to why internet-wide scanning is important, its history and goals, and a discussion of the feasibility and present best practices for those that want to join the party.

Gain visibility and insight

A few years ago internet-wide surveys were still deemed unfeasible, or at least too expensive to be worth the effort. There have been only a few projects that mapped out aspects of the internet - for example the IPv4 Census published by the University of Southern California in 2006. The project sent ICMP echo requests to all IPv4 addresses between 2003 and 2006 to collect statistics and trends about IP allocation. A more recent example of such research is the Internet Census 2012, which was accomplished through illegal means by the "Carna Botnet," which consisted of over 420,000 infected systems.

The EFF SSL Observatory investigated "publicly-visible SSL certificates on the Internet in order to search for vulnerabilities, document the practices of Certificate Authorities, and aid researchers interested the web's encryption infrastructure". Another case of widespread vulnerabilities in Serial Port Servers was published by HD Moore based on data from the Critical.IO internet scanning project.

The EFF Observatory, and even the botnet-powered Internet Census data, helps people understand trends and allows researchers to prioritize research based on the actual usage of devices and software. Raising awareness about widespread vulnerabilities through large-scale scanning efforts yields better insight into the service landscape on the Internet, and hopefully allows both the community and companies to mitigate risks more efficiently.

We believe that scanning efforts will be done by more people in the future and we consider it to be valuable to both researchers and companies. Research about problems / bugs raises awareness and companies gain visibility about their assets. Even though probing/scanning/data collection can be beneficial, there are dangers to it and it should always be conducted with care and using best practices. We provide more information on that below - and we will share all our data with the community to reduce data duplication and bandwidth usage among similar efforts.

Feasibility and costs

As mentioned, this kind of research was once considered to be very costly or even unfeasible. The Census projects either ran over a long time (2 months) or used thousands of devices. With the availability of better hardware and clever software, internet-wide scanning has become much easier and cheaper in recent years. The ZMap open source network scanner was built for this purpose and allows a GbE connected server to reach the entire Internet - all IPv4 - addresses - within 45 minutes. It achieves this by generating over a million packets per second if configured to use the full bandwidth. Of course this requires the hardware to be able to reach that throughput in packets and response processing. Other projects go even further - Masscan generates 25 million packets per second using dual 10 GE links and thus could reach the entire address space in 3 minutes.

So this means that technically one can do Internet-wide scans with a single machine - if the hardware is good enough. Of course this is not the only requirement - a couple others are needed as well.

  • The network equipment at the data center needs to be able to cope with the packet rates
  • The hosting company or ISP needs to allow this kind of action on their network (terms of service)
  • As "port scanning" is considered to be an "attack" by a large amount of IDSs and operators, this activity will generate a lot of abuse complaints to the hoster/ISP and you. Thus one needs to notify the hoster beforehand and agree on boundaries, packet rates, bandwidth limits and their view on abuse complaints.

Especially the network equipment and abuse handling process are things that are difficult to determine beforehand. We went through a couple of options before we were able to establish a good communication channel with our hosters and thus allowed to conduct this kind of project using their resources. We saw network equipment failing at packet rates over 100k/sec, high packet loss in others at rates over 500k/sec and had hosters locking our accounts even after we notified them about the process beforehand.

Settings and resource recommendations

We decided that for most scans it is actually not that important to bring the duration to below a few hours. Of course one might not get a real "snapshot" of the Internet state if the duration is longer - but the trade-off regarding error-rates, packet loss and impact on remote networks outweighs the higher speed by far in our opinion.

  • Always coordinate with the hosting company / ISP - make sure they monitor their equipment health and throughput and define limits for you
  • Don't scan unrouted (special) ranges within the IPv4 address space - the Zmap project compiled a list of these (also to be found on Wikipedia)
  • Benchmark your equipment before doing full scans - we list some recommendations below, but testing is key
  • Don't exceed 500k packets per second on one machine - this rate worked on most of the dedicated servers we tested and keeps scan time around 2 hours (still needs coordination with the hoster)
  • Distribute work across multiple source IPs and multiple source hosts - this reduces abuse complaints and allows you to use lower packet rates to achieve the same scan duration (lower error-rate)
  • When virtual machines are used keep in mind I/O delays due to the virtualization layer - use lower packet rates < 100k/sec (coordinate with hoster beforehand)
  • If possible, randomize target order to reduce loads on individual networks (Zmap provides this feature in a clever way)

Best practices

If one plans to do Internet wide scanning, maybe the most important aspect is to employ best practices and not interfere with availability of resources of others. The ZMap project put together a good list of these - and we summarize it here for the sake of completeness:

  • Coordinate not only with the hosting company but also with abuse reporters and other network maintainers
  • Review any applicable laws in your country/state regarding scanning activity - possibly coordinate with law enforcement
  • Provide possible opt-out for companies / network maintainers and exclude them from further scanning after a request
  • Explain scanning purpose and project goals clearly on a website (or similar) and refer involved people to it
  • Reduce packet rates and scan frequencies as much as possible for your research goal to reduce load and impact on networks and people

Implementation details

After covering the theoretical background and discussing goals and best practices, we want to mention a few of our implementation choices.

For port scanning we make use of the before-mentioned excellent Zmap software. The authors did a great job on the project and the clever IP randomization based on iterating over a cyclic group reduces load and impact on networks while still keeping very little state. Despite Zmap providing almost everything needed, we just use it as a Syn-scanner and do not actually implement probing modules within Zmap. The reachable hosts / ports are collected from Zmap and then processed on other systems using custom clients or possibly even Nmap NSE scripts - depending on the scan goal.

So as an example, for downloading SSL/TLS certificates from https webservers, we do a 443/TCP Zmap scan and feed the output to a couple of other systems that immediately connect to those ports and download the certificates.  This choice allowed us to implement simple custom code that is able to handle SSLv2 and the latest TLS at the same time. As we see slightly below 1% of the Internet having port 443 open, we have to handle around 5000 TCP connections per second when using Zmap with 500k packets per second. These 5000 TCP connections per second are distributed across a few systems to reduce error-rates.

As another example serves our DNS lookup job. When looking up massive amounts of DNS records (for example all common names found in the certificates) we use a relatively high amount of virtual machines across the world to reduce load on the DNS resolvers. In the implementation we use the excellent c-ares library from the cURL project. You can find our mass DNS resolver using c-ares in Python here.

Conclusion

In our opinion visibility into publicly available services and software is lacking severely. Research like the EFF observatory leads to better knowledge about problems and allows us to improve the security of the Internet overall. Datasets from efforts such as the Internet Census 2012, even though obtained through illegal means, provide fertile ground for researchers to find vulnerabilities and misconfigurations. If we were able to obtain datasets like this on a regular basis through legal means and without noticeable impact on equipment, it would allow the security community to research trends, statistics and problems of our current public Internet.

Companies can also use these kinds of datasets to gain visibility into their assets and public facing services. They can reduce risks of misconfiguration and learn about common problems with devices and software.

We intend to coordinate with other research institutions and scanning projects to be able to provide the data to everyone. We want to establish boundaries and limits of rates and frequencies for the best practices in order to reduce the impact of this kind of research on networks and IT staff.

By leveraging the data, together we can hopefully make the Internet a little safer in the future.