LMAP WG D. Goergen Internet-Draft R. State Intended status: Informational University of Luxembourg Expires: January 16, 2014 V. Gurbani Bell Labs, Alcatel-Lucent July 15, 2013 Aggregating large-scale measurements for Application Layer Traffic Optimization (ALTO) Protocol draft-goergen-lmap-fcc-00 Abstract Analyzing and aggregating large-scale broadband measurements is essential to study trends and derive network analytics. These trends and analyses could be made available through well defined protocols such as the Application Layer Traffic Optimization (ALTO) protocol. However, ALTO requires network information to be distilled and abstracted in form of a network map and a cost map. We describe our methodology for analyzing the United States Federal Communication Commission's (FCC) Measuring Broadband America (MBA) dataset to derive required topology and cost maps suitable for consumption by an ALTO server. Goergen, et al. Expires January 16, 2014 [Page 1] Internet-Draft ALTO Maps July 2013 Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 16, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Goergen, et al. Expires January 16, 2014 [Page 2] Internet-Draft ALTO Maps July 2013 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Challenges in data analysis . . . . . . . . . . . . . . . . . 5 3. Geo-locating the units . . . . . . . . . . . . . . . . . . . . 6 4. Conclusions and future work . . . . . . . . . . . . . . . . . 9 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 6. Security Considerations . . . . . . . . . . . . . . . . . . . 11 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 7.1. Normative References . . . . . . . . . . . . . . . . . . . 12 7.2. Informative References . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 Goergen, et al. Expires January 16, 2014 [Page 3] Internet-Draft ALTO Maps July 2013 1. Introduction Measuring broadband performance is increasingly important as communications continue to move towards the Internet. Internet service providers (ISP), national agencies and other entities gather broadband data and may provide some, or all, of the dataset to the public for analysis. As [I-D.seedorf-lmap-alto] notes, there are two extremes prevalent for presenting large-scale data. One is in the form of charts, figures, or summarized reports amenable for easy and quick consumption. The other extreme includes releasing raw data in the form of large files containing tables formatted as values separated by a delimiter. While the former is indispensable to acquire a summary view of the dataset, it does not suffice for additional analysis beyond what is presented. Conversely, the problem with the latter option (raw files) is that the unsuspecting user perusing them is lost in the deluge of data. [I-D.seedorf-lmap-alto] offers the argument that a reasonable medium between the two extremes may be a protocol that allows a constrained set of user-driven ad-hoc queries on the dataset. It further offers that the Application Layer Traffic Optimization (ALTO) protocol [I-D.ietf-alto-protocol] be the protocol of choice that allows such reasoning on the dataset. A necessary prerequisite for using ALTO is abstracting the network information into a form that is suitable for consumption by the protocol. The implication of using ALTO is that data from any large-scale measurement effort must first be distilled in two maps: a topology map and a cost map. Further analysis and ad- hoc queries can be subsequently performed on the normalized dataset. In the United States, the Federal Communication Commission (FCC) has embarked on a nationwide performance study of residential wireline broadband service [fcc]. Our aim is to use the raw datasets from this study for analysis and to create a topology map and a cost map from this dataset. ALTO queries aimed at these maps will enable users and interested parties to fulfill the use cases listed in Section 2 of [I-D.seedorf-lmap-alto]. Goergen, et al. Expires January 16, 2014 [Page 4] Internet-Draft ALTO Maps July 2013 2. Challenges in data analysis The FCC Measuring Broadband America (MBA) study consisted of 7,782 volunteers spread across the United States with adequate geographic diversity. Volunteers opted in for the study, however, each of the volunteers remained anonymous. An opaque integral number (unit_id) represented a subscriber in the raw dataset. This unit_id remains constant during the duration of the study in the dataset and uniquely identifies a volunteer subscriber, even if the subscriber switches the ISP. More detail about the methodology used is described in [fcc]. The dataset consisted of 12 tables, each table corresponding to the data drawn from a certain performance test. For the analysis we present in this document we focus on the "curr_dns" table, which contains the time taken for the ISP's recursive DNS resolver to return a DNS A RR for a popular website domain name. This test was ran approximately every hour in a 24-hour period, and produced about 75-78 million records per month. This resulted in a typical file size in the range of 6-7 GBytes per month. We note that the "curr_dns" table is one of the smaller tables in the dataset. The first challenge, therefore, was to arrive at computing resources comparable in scale with respect to the dataset consisting of millions of records spread across gigabyte-sized files. To analyze the volume of data we used a canonical Map-Reduce computational paradigm on a Hadoop cluster (more details on the methodology are outlined in Section 3). A second, more pressing challenge, was to identify the geographic location of the unit_ids generating the data. In order to derive a topological map and impose costs on the links, it is important to know the physical locations of the unit_ids that contributed the measurements. However, in the MBA dataset, the population is anonymized and the individual subscriber reporting the measurement data is simply referred to by an opaque integral number. Therefore, an important task was to use the information in the public tables to reveal a coarse location of the subscriber. We outline the methodology we used to do so in the next section. We stress that this methodology does not identify the specific location of a subscriber, who still remains anonymous. Instead, it simply locates the subscriber in a larger metropolitan region. This level of granularity suffices for our work. Goergen, et al. Expires January 16, 2014 [Page 5] Internet-Draft ALTO Maps July 2013 3. Geo-locating the units To geo-locate the units, we simply note that broadband subscriber devices are likely to be configured using DHCP by their ISP. Besides imparting an IP address to the subscriber device, DHCP also populates the DNS name servers the subscriber devices uses for DNS queries. In most installations, these DNS name servers are located in close physical proximity of the subscriber device. The FCC technical appendix states that the DNS resolution tests were targeted directly at the ISP's recursive resolvers to circumvent caching and users configuring the subscriber device to circumvent the ISP's DNS resolvers. Therefore, a reasonable approximation of a subscribers geo-location could be the geographic location of the DNS name server serving the subscriber. We use this very heuristic to geo-locate a subscriber. Thus our first, and very simple filter consisted of obtaining a mapping from a unit_id (representing a subscriber) to one or more DNS name servers that the unit_id is sending DNS requests to. It turned out that while this was a necessary condition for advancing, it was not a sufficient one. The raw data would need to be further processed to reduce inconsistencies and remove outliers. A number of interesting artifacts were uncovered during further processing of the data. These artifacts informed the selection of the unit_ids for further analysis. The artifacts are documented below. o A handful of unit_ids were geo-located in areas outside the contiguous United States, such as Ukraine, Poland or the United Kingdom. We theorize that the subscribers corresponding to the unit_ids geo-located outside the contiguous United States had simply configured their devices to use alternate DNS servers, probably located outside the United States. We removed these records before conducting our analysis. o We also observed a reasonable number of non-ISP DNS resolvers, especially Google's 8.8.8.8 and 8.8.4.4 and OpenDNS 208.67.222.222 and 208.67.220.220. These 4 public DNS servers are geo-located in California. We removed these records to ensure that the specific location that these resolvers represented was not oversampled. o We noticed that a large number of unit_ids were being geo-located in Potwin, Kansas. Intrigued as to why there appeared to be a large population of Internet users being located in a small rural community in Kansas, we investigated further. It appears that Potwin, Kansas is the geographical center of the United States and a number of ISPs have chosen to establish data centers in or Goergen, et al. Expires January 16, 2014 [Page 6] Internet-Draft ALTO Maps July 2013 around the Potwin area. These ISPs generally locate their primary or secondary DNS name servers in Potwin-area data centers, thus accounting for the popularity of Potwin as an Internet destination. We continue to further investigate on minimizing the impact of such natural aggregation points that, if not accounted for, will skew our results in an unwarranted direction. o We observed some unit_ids changing ISPs during the observation period. This is a normal occurrence and to the extent that the unit_id is geo-located in the same geographical area after the change in ISP, we do not exclude such unit_ids from further analysis. Subsequent filters extracted the stable unit_ids from our dataset. In order to determine which unit_id are stable, i.e., remain constant with respect to their geographic location over the observation period from January to December 2012, we extracted for each unit_id the IP address of each DNS name server it consulted. This is obtained by applying the map reduce paradigm on the DNS dataset. We extracted for each unit_id the triggered DNS servers and obtained the individual DNS servers accessed by a unit_id. This was repeated for each month of the observation period. The resulting sets were cleaned up of private IP addresses and other artifacts discussed above. The cleaned set consisted of about 8000 distinct unit_id. In order to determine the stability of each unit_id we proceeded to sum up the occurrences of IP addresses over the whole observation period separated in monthly files. If the IP address of a DNS server occurred 12 times this meant that the unit_id always accessed the same DNS server and therefore remained stable over the observation period. The obtained stable unit_ids, around 1500, will be used for further analysis. Assuming a 99% confidence level and +/- 3 point margin of error, we will require a sample of 1494 unit_ids. With our stable unit_id set of 1500 unit_ids, we are now positioned to perform further analysis on the dataset to create the full topology and cost maps. Table 1 presents a sample of the geographic location data that we have uncovered for unit_ids. A complete list of identified units superimposed on the geographical map of the United States is available at http://cdb.io/13UOHgD. Goergen, et al. Expires January 16, 2014 [Page 7] Internet-Draft ALTO Maps July 2013 +---------+-----------------+--------------------------+ | Unit ID | City, State | Latitude/Longitude | +---------+-----------------+--------------------------+ | 872 | Morganville, NJ | 40.35950089,-74.26280212 | | | | | | 885 | Madison, WI | 43.07310104,-89.40119934 | | | | | | 898 | Foley, AL | 30.40660095,-87.68360138 | | | | | | 7969 | Manteca, CA | 37.79740143,-121.2160034 | | | | | | 8024 | Quincy, MA | 42.25289917,-71.00229645 | +---------+-----------------+--------------------------+ Sample unit identification tuples Table 1 Goergen, et al. Expires January 16, 2014 [Page 8] Internet-Draft ALTO Maps July 2013 4. Conclusions and future work Identification of the geographic location of the unit_ids generating the performance data is essential in order to continue the work. We have presented a methodology and some early results in identifying a geographic location. This location, although coarse, suffices for our future work that will consist of further data mining and analysis to create appropriate ALTO network and cost maps. Goergen, et al. Expires January 16, 2014 [Page 9] Internet-Draft ALTO Maps July 2013 5. IANA Considerations This document does not contain any IANA considerations Goergen, et al. Expires January 16, 2014 [Page 10] Internet-Draft ALTO Maps July 2013 6. Security Considerations There are no security artifacts that have been invalidated due to our analysis. All of our analysis was performed on publicly available data. However, we do note that some privacy may have been lost based on our analysis. In the raw dataset, the unit identifiers are opaque strings with no immediate correlation with a geographic location. After our analysis, while the unit identifiers still remain opaque, they are nonetheless correlated to a specific, though coarse, geographic location. Goergen, et al. Expires January 16, 2014 [Page 11] Internet-Draft ALTO Maps July 2013 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 7.2. Informative References [I-D.ietf-alto-protocol] Alimi, R., Penno, R., and Y. Yang, "ALTO Protocol", draft-ietf-alto-protocol-17 (work in progress), July 2013. [I-D.seedorf-lmap-alto] Seedorf, J., Gurbani, V., and E. Marocco, "ALTO for Querying LMAP Results", draft-seedorf-lmap-alto-01 (work in progress), July 2013. [fcc] United States Federal Communications Commission, "Measuring Broadband America", Accessed July 12, 2013, http://www.fcc.gov/measuring-broadband-america. Goergen, et al. Expires January 16, 2014 [Page 12] Internet-Draft ALTO Maps July 2013 Authors' Addresses David Goergen University of Luxembourg Email: david.goergen@uni.lu Radu State University of Luxembourg Email: radu.state@uni.lu Vijay K. Gurbani Bell Labs, Alcatel-Lucent Email: vijay.gurbani@alcatel-lucent.com Goergen, et al. Expires January 16, 2014 [Page 13]