Distance Per Network Hop Technical Report

 

This is a technical data analysis of physical distance per network hop done for Hopzero.
Peter Mullarkey, PhD, Hopzero and RedRunGroup
Version 2.0

Motivation

Having a basic understanding of the physical distance between network hops can provide insight to a company’s exposure (or risk) surface. Meaning that if a company is communicating with device A that is 3 network hops away, versus Device B that is 10 network hops away, it stands to reason that the Device A is physically closer than Device B. It is less concrete that these distances can be equated to political groupings, such as being in the same county, state, or region of a country, but this is also qualitatively true and can be useful in understanding risk exposure.

While there are some literature wrestling with this challenge of providing information on distance per hop (DpH), most were approaches or surveys, and we were interested in a less theoretical, but more practical statically characterized metric.

While doing generalized data collection for Hopzero, I noticed that we had over 8600 geo-located public IP addresses. And given that we could communicate with those hosts from four datacenters with known locations, we decided to collect as many of those point-to-point data as a basis for studying the character of distance per hop. And we were able to get an overall response rate from the 8640 devices of 39.3 percent.

Let’s map all of those destinations locations, along with the 4 source locations (shown in green).

Distance Per Network Hop Technical Report

Assumptions

One of the most important assumptions to state and comment on is that we used a linear distribution of the distance across the hops involved in a given point-to-point measurement. As an example, if a traceroute was done from the Western US datacenter to a host 2000 miles away and that measurement crossed 5 routers, we would say the distance per hop for that measurement was 400 miles. Given that we did not have GeoIP information for each intervening router, we could not apportion of the distance in any other way. We still think this comprehensive dataset has a lot for teach us, even with this simplifying assumption. And we discuss possible better approaches in the Future Work section.

Explore the datasets

Although our dataset of destinations was international, the vast majority of the destinations were in the United States, with the second highest density being Europe, so it makes sense to explore the character of distance per hop on a regional basis. In addition, we will consider the character of the full dataset.

Probing from Northwestern United States

Inspecting the West dataset, we see that there were responses from 36% of the possible hosts. This data was collected using a node in the AWS Oregon datacenter.

Distance Per Network Hop Technical Report
Distance Per Network Hop Technical Report
## n
## 1 36.07639

## miles_per_hop
## Min. : 4.07
## 1st Qu.: 68.58
## Median :101.11
## Mean :130.46
## 3rd Qu.:206.15
## Max. :456.62

Outlier Analysis

Looking at the West dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.25% of the data is above 400 miles per hop, and looking at the summary stats with and without those points makes very little difference, so there is no need to subset the West data set.

## n
## 1 0.2413339

## miles_per_hop
## Min. : 4.07
## 1st Qu.: 68.55
## Median :100.50
## Mean :129.76
## 3rd Qu.:206.15
## Max. :391.37

West data analysis

As we inspect the histogram of the full West dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.

Now consider whether the bi-modal nature might be caused by links that cross oceans. This analysis was suggested by Bill Alderson and seems solid.

Distance Per Network Hop Technical Report
Based on inspection of the overall and by-continent histograms, it seems like the low point between the peaks is around 150 miles per hop. So let’s split the dataset into two datasets, one for DpH < 150 and one for DpH > 150. Looking at these separate histograms, with the median marked by the green vertical dashed line, we see that we have a lower median of 75 miles per hop, and a upper median of 223 miles per hop.
Distance Per Network Hop Technical Report
## miles_per_hop
## Min. : 4.07
## 1st Qu.: 40.22
## Median : 74.78
## Mean : 71.42
## 3rd Qu.: 94.53
## Max. :148.30
Distance Per Network Hop Technical Report
## miles_per_hop
## Min. :150.7
## 1st Qu.:198.4
## Median :222.6
## Mean :233.2
## 3rd Qu.:257.4
## Max. :456.6

As a slight aside, during data collection, we used several target ports to increase our likelihood to being able to contact the target host and get a valid hop count. Although no useful conclusions can be made from the number of connections on a per destination port, since that was our synthetic traffic, it does provide one interesting observation. As shown in the following graph, it is apparent that there were around the same numbers of hosts that had port 3389 (Microsoft Terminal Server – RDP)open to the public as port 22 (SSH).

Distance Per Network Hop Technical Report

Probing from Eastern United States

Inspecting the East dataset, we see that there were responses from 35% of the possible hosts. This data was collected using a node in the AWS Ohio datacenter.

Distance Per Network Hop Technical Report
Distance Per Network Hop Technical Report
## n
## 1 0.3512731
## miles_per_hop
## Min.   : 0.3582
## 1st Qu.: 22.1110
## Median : 74.5781
## Mean   :109.2136
## 3rd Qu.:192.0921
## Max.   :596.3041

Outlier Analysis

Looking at the East dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.7% of the data is above 400 miles per hop, and looking at the summary stats with and without those points makes very little difference (the mean shifts from 109.2 to 106.8), so there is no need to subset the East data set.

## n
## 1 0.6938081
## miles_per_hop
## Min. : 0.3582
## 1st Qu. : 21.7302
## Median : 74.2513
## Mean :106.7863
## 3rd Qu. :190.6372
## Max. :399.4519

Eastern US data analysis

As we inspect the histogram of the full East dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.

Since that bi-modal nature seemed to be caused by trans-water crossing, based on the analysis of the probing from the Northwestern US, try that same analysis for this dataset.

Distance Per Network Hop Technical Report
Based on inspection of the full and by-continent histograms, it seems like the low point between the peaks is around 130 miles per hop. So let’s split the dataset into two datasets, one for DpH < 130 and one for DpH > 130. Looking at these separate histograms, with the median marked by the green vertical dashed line, we see that we have a lower median of 32 miles per hop, and a upper median of 239 miles per hop.
Distance Per Network Hop Technical Report
## miles_per_hop
## Min. : 0.3582
## 1st Qu. : 16.0988
## Median : 32.3434
## Mean : 44.3401
## 3rd Qu. : 74.7046
## Max. :129.8424
Distance Per Network Hop Technical Report
## miles_per_hop
## Min.      :130.0
## 1st Qu. :193.8
## Median :238.5
## Mean     :241.4
## 3rd Qu.  :272.3
## Max.       :596.3

Probing from Europe

Inspecting the European dataset, we see that there were responses from 34% of the possible hosts. This data was collected using a node in the AWS Frankfurt, Germany datacenter.

Distance Per Network Hop Technical Report
Distance Per Network Hop Technical Report
## n
## 1 0.3353009
## miles_per_hop
## Min.    : 0.15
## 1st Qu. : 192.59
## Median  : 241.70
## Mean    : 235.43
## 3rd Qu. : 289.51
## Max.    :1136.46

Outlier Analysis

Looking at the European dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.88% of the data is above 600 miles per hop, and looking at the summary stats with and without those points makes very little difference (the mean shifts from 235 to 231), so there is no need to subset the European data set.

## n
## 1 0.8770478
## miles_per_hop
## Min. : 0.15
## 1st Qu. :192.16
## Median :241.70
## Mean :231.37
## 3rd Qu. :288.73
## Max. :588.93

European data analysis

As we inspect the histogram of the full European dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.

Since that bi-modal nature seemed to be caused by trans-water crossing, based on the analysis of the probing from the northwestern and eastern US, try that same analysis for this dataset.

Distance Per Network Hop Technical Report
Based on inspection of the full and by-continent histograms, it seems like the low point between the peaks is around 130 miles per hop. So let’s split the dataset into two datasets, one for DpH < 130 and one for DpH > 130. Looking at these separate histograms, with the median marked by the green vertical dashed line, we see that we have a lower median of 33 miles per hop, and a upper median of 254 miles per hop.
Distance Per Network Hop Technical Report
## miles_per_hop
## Min. : 0.15
## 1st Qu. : 17.54
## Median : 32.47
## Mean : 44.84
## 3rd Qu. : 65.90
## Max. :129.86
Distance Per Network Hop Technical Report
## miles_per_hop
## Min. : 130.1
## 1st Qu.: 211.9
## Median : 254.4
## Mean : 267.1
## 3rd Qu.: 298.6
## Max. :1136.5

Probing from India

Inspecting the India dataset, we see that there were responses from 33.5 of the possible hosts. This data was collected using a node in the AWS Mumbai, India datacenter.

Distance Per Network Hop Technical Report
Distance Per Network Hop Technical Report
## miles_per_hop
## Min. : 0.64
## 1st Qu. :156.51
## Median :306.16
## Mean :286.15
## 3rd Qu. :389.17
## Max. :760.06

Outlier Analysis

## miles_per_hop
## Min. : 0.64
## 1st Qu. :156.33
## Median :305.37
## Mean :285.22
## 3rd Qu. :389.17
## Max. :585.16

Looking at the India dataset histogram, there are some data points out in the upper tail. Let’s have a look at those data points. Based on this analysis, less than 0.2% of the data is above 600 miles per hop, and looking at the summary stats with and without those points makes very little difference (the mean shifts from 286 to 285, so there is no need to subset the India data set.

India data analysis

As we inspect the histogram of the full India dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.

Since that bi-modal nature seemed to be caused by trans-water crossing, based on the analysis of the probing from the northwesternUS, eastern US, and Europe, try that same analysis for this dataset.

Distance Per Network Hop Technical Report
Based on inspection of the full and by-continent histograms, it seems like the low point between the peaks is around 240 miles per hop. So let’s split the dataset into two datasets, one for DpH < 240 and one for DpH > 240. Looking at these separate histograms, with the median marked by the green vertical dashed line, we see that we have a lower median of 127 miles per hop, and a upper median of 363 miles per hop.
Distance Per Network Hop Technical Report
## miles_per_hop
## Min. : 0.64
## 1st Qu. :108.76
## Median :127.13
## Mean :136.95
## 3rd Qu. :170.51
## Max. :239.40
Distance Per Network Hop Technical Report
##  miles_per_hop   
##  Min.   :241.0   
##  1st Qu.:311.3   
##  Median :362.9   
##  Mean   :368.8   
##  3rd Qu.:419.2   
##  Max.   :760.1

Full Dataset

Distance Per Network Hop Technical Report
Distance Per Network Hop Technical Report
##  miles_per_hop     
##  Min.   :   0.15   
##  1st Qu.:  73.71   
##  Median : 175.45   
##  Mean   : 179.40   
##  3rd Qu.: 263.76   
##  Max.   :1136.46

Full data analysis

As we inspect the histogram of the full merged dataset, we see two distinct clusters of data points. This is typically referred to as a Bi-modal distribution. We will generate a statistical summary for the full dataset, but given the dual peak nature, that may give us limited insight.

Since that bi-modal nature of the data from the perspective of each probing point seemed to be caused by trans-water crossing, consider that approach for the full dataset. In this case, with all the probing points merged, it is not as clarifying as it was for the region-by-region analyses.

Distance Per Network Hop Technical Report
Based on inspection of the full histogram, it seems like the low point between the peaks is around 240 miles per hop. So let’s split the dataset into two datasets, one for DpH < 240 and one for DpH > 240. Looking at these separate histograms, with the median marked by the green vertical dashed line, we see that we have a lower median of 94 miles per hop, and a upper median of 310 miles per hop.
Distance Per Network Hop Technical Report
##  miles_per_hop    
##  Min.   :  0.15   
##  1st Qu.: 36.84   
##  Median : 94.10   
##  Mean   :107.44   
##  3rd Qu.:178.06   
##  Max.   :239.99 
Distance Per Network Hop Technical Report
##  miles_per_hop    
##  Min.   : 240.1   
##  1st Qu.: 268.8   
##  Median : 310.1   
##  Mean   : 330.6   
##  3rd Qu.: 375.6   
##  Max.   :1136.5

Possible sources of the Bi-model Distribution

Considering the hypothesis that the peak at lower miles_per_hop is for network links in urban areas, and the peak at longer miles_per_hop is for more rural areas, I started looking for a way to map the lat/long into some measure of urban/rural.

As I thought harder, I realized that we only had the starting and ending geolocation. And even if we could classify those endpoints, we could say nothing definitive about the locations in between. So if we had a rural source and destination, we might consider that data entry to be classified as “rural-rural” but in fact, the routers in between could all be in urban areas. So until we have a more comprehensive dataset where we can geolocate all the intermediate routers, we will need to leave the issue of what is causing the bi-modal character as a topic for further research. Included in Appendix A is the work I did in this area, since it may be useful in future work.

And based on Bill Alderson’s suggestion of analyzing by-continent, it seems this approach of analyzing by-continent accounts for the majority of the bi-modal behavior.

Future Work

Responding to the assumption of linear apportionment of distance across hops noted in the Assumptions section, it hit me that traceroute typically returns the network round trip time (NRTT) between each successive hop as it endeavors to trace the path from a source device to a destination device. And given that the predominant portion of NRTT is the distance delay, it may be an interesting approach to estimate distance per hop along the entire traceroute path using the conversion from NRTT to distance. More work needs to be done to assess the practicality of this approach, but it could provide an even larger and more robust distance per hop dataset.

Conclusions

A key observation is that when one is considering DpH within a continent, the lower mode statistics (e.g., 94 miles per hop) would be a good guideline. When going cross-continent, the upper mode statistics (e.g., 310 miles per hop) would be a good guideline.

Lower Mode Upper Mode Both Modes
Mean Median Mean Median Mean Median
West 71 75 233 223 131 101
East 44 32 241 239 109 75
Europe 45 33 267 256 236 241
India 137 127 369 363 286 306
Full Dataset 107 94 331 310 179 176

 

Appendix A – Classification of geolocations as Urban/Rural

This paper wrestles with a similar issue https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6466258/ and this resource has a transformation from FIPS County Codes to the Rural-Urban Commuting Area (RUCA) Codes, which would provide a reasonable urban to rural range value. Then I found a fcc.gov web api https://geo.fcc.gov/api/census/#!/block/get_block_find that will transform from lat/long to FIPS county codes. An example of use is https://geo.fcc.gov/api/census/block/find?latitude=24.9896&longitude=121.3187&censusYear=2010&format=json. But initial testing resulted in mostly empty responses. These responses may be a side effect of the worldwide nature of our datapoints.

Tags:
<a href="https://hopzero.com/author/petermullarkey54/" target="_self">Peter Mullarkey</a>

Peter Mullarkey

Software architect with a Ph.D from Carnegie-Mellon with substantial data analytical experience. Holds 7 patents in areas ranging from anomaly detection to knowledge-based decision-making systems to data visualization to simulation, along with numerous published papers. Enjoys providing technical leadership to software teams and interacting with sales, marketing, and especially customers. A hands-on architect, with enthusiasm for writing code daily.

Peter Mullarkey

Software architect with a Ph.D from Carnegie-Mellon with substantial data analytical experience. Holds 7 patents in areas ranging from anomaly detection to knowledge-based decision-making systems to data visualization to simulation, along with numerous published papers. Enjoys providing technical leadership to software teams and interacting with sales, marketing, and especially customers. A hands-on architect, with enthusiasm for writing code daily.

Read More

Keeping Data on a Short Leash to Avoid Breaches

Keeping Data on a Short Leash to Avoid Breaches

Even the best-trained dogs have leashes while in public. Despite how much one trusts their dog to act obediently, it simply is not possible to know what kind of situations one might encounter while on a walk—maybe an enticing squirrel? A loud noise? Another dog? Dogs...