This is a technical data analysis of physical distance per network hop done for Hopzero.
Peter Mullarkey, PhD, Hopzero and RedRunGroup
Version 2.0
Motivation
Having a basic understanding of the physical distance between network hops can provide insight to a company’s exposure (or risk) surface. Meaning that if a company is communicating with device A that is 3 network hops away, versus Device B that is 10 network hops away, it stands to reason that the Device A is physically closer than Device B. It is less concrete that these distances can be equated to political groupings, such as being in the same county, state, or region of a country, but this is also qualitatively true and can be useful in understanding risk exposure.
While there are some literature wrestling with this challenge of providing information on distance per hop (DpH), most were approaches or surveys, and we were interested in a less theoretical, but more practical statically characterized metric.
While doing generalized data collection for Hopzero, I noticed that we had over 8600 geo-located public IP addresses. And given that we could communicate with those hosts from four datacenters with known locations, we decided to collect as many of those point-to-point data as a basis for studying the character of distance per hop. And we were able to get an overall response rate from the 8640 devices of 39.3 percent.
Let’s map all of those destinations locations, along with the 4 source locations (shown in green).
Assumptions
One of the most important assumptions to state and comment on is that we used a linear distribution of the distance across the hops involved in a given point-to-point measurement. As an example, if a traceroute was done from the Western US datacenter to a host 2000 miles away and that measurement crossed 5 routers, we would say the distance per hop for that measurement was 400 miles. Given that we did not have GeoIP information for each intervening router, we could not apportion of the distance in any other way. We still think this comprehensive dataset has a lot for teach us, even with this simplifying assumption. And we discuss possible better approaches in the Future Work section.
Explore the datasets
Although our dataset of destinations was international, the vast majority of the destinations were in the United States, with the second highest density being Europe, so it makes sense to explore the character of distance per hop on a regional basis. In addition, we will consider the character of the full dataset.
Probing from Northwestern United States
Inspecting the West dataset, we see that there were responses from 36% of the possible hosts. This data was collected using a node in the AWS Oregon datacenter.
## n
## 1 36.07639## miles_per_hop
## Min. : 4.07
## 1st Qu.: 68.58
## Median :101.11
## Mean :130.46
## 3rd Qu.:206.15
## Max. :456.62
Outlier Analysis
Looking at the West dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.25% of the data is above 400 miles per hop, and looking at the summary stats with and without those points makes very little difference, so there is no need to subset the West data set.
## n
## 1 0.2413339## miles_per_hop
## Min. : 4.07
## 1st Qu.: 68.55
## Median :100.50
## Mean :129.76
## 3rd Qu.:206.15
## Max. :391.37
West data analysis
As we inspect the histogram of the full West dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.
Now consider whether the bi-modal nature might be caused by links that cross oceans. This analysis was suggested by Bill Alderson and seems solid.
## miles_per_hop
## Min. : 4.07
## 1st Qu.: 40.22
## Median : 74.78
## Mean : 71.42
## 3rd Qu.: 94.53
## Max. :148.30
## miles_per_hop
## Min. :150.7
## 1st Qu.:198.4
## Median :222.6
## Mean :233.2
## 3rd Qu.:257.4
## Max. :456.6
As a slight aside, during data collection, we used several target ports to increase our likelihood to being able to contact the target host and get a valid hop count. Although no useful conclusions can be made from the number of connections on a per destination port, since that was our synthetic traffic, it does provide one interesting observation. As shown in the following graph, it is apparent that there were around the same numbers of hosts that had port 3389 (Microsoft Terminal Server – RDP)open to the public as port 22 (SSH).
Probing from Eastern United States
Inspecting the East dataset, we see that there were responses from 35% of the possible hosts. This data was collected using a node in the AWS Ohio datacenter.
## n ## 1 0.3512731 ## miles_per_hop ## Min. : 0.3582 ## 1st Qu.: 22.1110 ## Median : 74.5781 ## Mean :109.2136 ## 3rd Qu.:192.0921 ## Max. :596.3041
Outlier Analysis
Looking at the East dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.7% of the data is above 400 miles per hop, and looking at the summary stats with and without those points makes very little difference (the mean shifts from 109.2 to 106.8), so there is no need to subset the East data set.
## n
## 1 0.6938081
## miles_per_hop
## Min. : 0.3582
## 1st Qu. : 21.7302
## Median : 74.2513
## Mean :106.7863
## 3rd Qu. :190.6372
## Max. :399.4519
Eastern US data analysis
As we inspect the histogram of the full East dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.
Since that bi-modal nature seemed to be caused by trans-water crossing, based on the analysis of the probing from the Northwestern US, try that same analysis for this dataset.
## miles_per_hop
## Min. : 0.3582
## 1st Qu. : 16.0988
## Median : 32.3434
## Mean : 44.3401
## 3rd Qu. : 74.7046
## Max. :129.8424
## miles_per_hop ## Min. :130.0 ## 1st Qu. :193.8 ## Median :238.5 ## Mean :241.4 ## 3rd Qu. :272.3 ## Max. :596.3
Probing from Europe
Inspecting the European dataset, we see that there were responses from 34% of the possible hosts. This data was collected using a node in the AWS Frankfurt, Germany datacenter.
## n ## 1 0.3353009 ## miles_per_hop ## Min. : 0.15 ## 1st Qu. : 192.59 ## Median : 241.70 ## Mean : 235.43 ## 3rd Qu. : 289.51 ## Max. :1136.46
Outlier Analysis
Looking at the European dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.88% of the data is above 600 miles per hop, and looking at the summary stats with and without those points makes very little difference (the mean shifts from 235 to 231), so there is no need to subset the European data set.
## n
## 1 0.8770478
## miles_per_hop
## Min. : 0.15
## 1st Qu. :192.16
## Median :241.70
## Mean :231.37
## 3rd Qu. :288.73
## Max. :588.93
European data analysis
As we inspect the histogram of the full European dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.
Since that bi-modal nature seemed to be caused by trans-water crossing, based on the analysis of the probing from the northwestern and eastern US, try that same analysis for this dataset.
## miles_per_hop
## Min. : 0.15
## 1st Qu. : 17.54
## Median : 32.47
## Mean : 44.84
## 3rd Qu. : 65.90
## Max. :129.86
## miles_per_hop
## Min. : 130.1
## 1st Qu.: 211.9
## Median : 254.4
## Mean : 267.1
## 3rd Qu.: 298.6
## Max. :1136.5
Probing from India
Inspecting the India dataset, we see that there were responses from 33.5 of the possible hosts. This data was collected using a node in the AWS Mumbai, India datacenter.
## miles_per_hop
## Min. : 0.64
## 1st Qu. :156.51
## Median :306.16
## Mean :286.15
## 3rd Qu. :389.17
## Max. :760.06
Outlier Analysis
## miles_per_hop
## Min. : 0.64
## 1st Qu. :156.33
## Median :305.37
## Mean :285.22
## 3rd Qu. :389.17
## Max. :585.16
Looking at the India dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.2% of the data is above 600 miles per hop, and looking at the summary stats with and without those points makes very little difference (the mean shifts from 286 to 285, so there is no need to subset the India data set.
India data analysis
As we inspect the histogram of the full India dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.
Since that bi-modal nature seemed to be caused by trans-water crossing, based on the analysis of the probing from the northwesternUS, eastern US, and Europe, try that same analysis for this dataset.
## miles_per_hop
## Min. : 0.64
## 1st Qu. :108.76
## Median :127.13
## Mean :136.95
## 3rd Qu. :170.51
## Max. :239.40
## miles_per_hop ## Min. :241.0 ## 1st Qu.:311.3 ## Median :362.9 ## Mean :368.8 ## 3rd Qu.:419.2 ## Max. :760.1
Full Dataset
## miles_per_hop ## Min. : 0.15 ## 1st Qu.: 73.71 ## Median : 175.45 ## Mean : 179.40 ## 3rd Qu.: 263.76 ## Max. :1136.46
Full data analysis
As we inspect the histogram of the full merged dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.
Since that bi-modal nature of the data from the perspective of each probing point seemed to be caused by trans-water crossing, consider that approach for the full dataset. In this case, with all the probing points merged, it is not as clarifying as it was for the region-by-region analyses.
## miles_per_hop ## Min. : 0.15 ## 1st Qu.: 36.84 ## Median : 94.10 ## Mean :107.44 ## 3rd Qu.:178.06 ## Max. :239.99
## miles_per_hop ## Min. : 240.1 ## 1st Qu.: 268.8 ## Median : 310.1 ## Mean : 330.6 ## 3rd Qu.: 375.6 ## Max. :1136.5
Possible sources of the Bi-model Distribution
Considering the hypothesis that the peak at lower miles_per_hop is for network links in urban areas, and the peak at longer miles_per_hop is for more rural areas, I started looking for a way to map the lat/long into some measure of urban/rural.
As I thought harder, I realized that we only had the starting and ending geolocation. And even if we could classify those endpoints, we could say nothing definitive about the locations in between. So if we had a rural source and destination, we might consider that data entry to be classified as “rural-rural” but in fact, the routers in between could all be in urban areas. So until we have a more comprehensive dataset where we can geolocate all the intermediate routers, we will need to leave the issue of what is causing the bi-modal character as a topic for further research. Included in Appendix A is the work I did in this area, since it may be useful in future work.
And based on Bill Alderson’s suggestion of analyzing by-continent, it seems this approach of analyzing by-continent accounts for the majority of the bi-modal behavior.
Future Work
Responding to the assumption of linear apportionment of distance across hops noted in the Assumptions section, it hit me that traceroute typically returns the network round trip time (NRTT) between each successive hop as it endeavors to trace the path from a source device to a destination device. And given that the predominant portion of NRTT is the distance delay, it may be an interesting approach to estimate distance per hop along the entire traceroute path using the conversion from NRTT to distance. More work needs to be done to assess the practicality of this approach, but it could provide an even larger and more robust distance per hop dataset.
Conclusions
A key observation is that when one is considering DpH within a continent, the lower mode statistics (e.g., 94 miles per hop) would be a good guideline. When going cross-continent, the upper mode statistics (e.g., 310 miles per hop) would be a good guideline.
Appendix A – Classification of geolocations as Urban/Rural
This paper wrestles with a similar issue https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6466258/ and this resource has a transformation from FIPS County Codes to the Rural-Urban Commuting Area (RUCA) Codes, which would provide a reasonable urban to rural range value. Then I found a fcc.gov web api https://geo.fcc.gov/api/census/#!/block/get_block_find that will transform from lat/long to FIPS county codes. An example of use is https://geo.fcc.gov/api/census/block/find?latitude=24.9896&longitude=121.3187&censusYear=2010&format=json. But initial testing resulted in mostly empty responses. These responses may be a side effect of the worldwide nature of our datapoints.