Behind the Data: The Web Index
As part of our Visualizing the Impact of the World Wide Web challenge, we sat down with Hania Farhan, the Director of Research for the Web Index at the Web Foundation, and asked her to describe the data, explain the process of gathering it, and tell us about the insights that she has already found.
What has been your involvement with the data?
I have been involved in designing and constructing the Index and determining the indicators that constitute the various components of the Index. I have also been involved in deciding on the statistical methodology and ensuring that the process is robust and rigorous.
Describe the data. What does it contain? How many columns and rows?
There are 88 indicators in total, of which 34 are data from secondary sources ("secondary" indicators) and 54 are primary indicators (indicators that were gathered by the Web Foundation using an expert assessment survey for the 81 countries covered by the Web Index). The data cover 81 countries (so, 81 rows), for a period of 6 years (2007-2012) for secondary data, but only one year (2013) for the primary data.
The overall composite Index is computed - essentially - by taking the raw data, imputing missing data points where possible for each indicator, normalising all the indicators (using z-scores), and then aggregating groups of indicators as per the tree diagram by, effectively, taking the weighted averages (in the normalised format) into 9 components. The results for each component are then further aggregated to compute the sub-Indexes, and finally, the sub-Indexes are aggregated (using a weighted average) to compute the overall composite Web Index. Equal weights are used throughout.
The Web Index dataset contains indicators that capture each of the 9 dimensions that the Index tries to measure. Indicators were selected according to suitability (there are certain criteria that indicators have to fulfill in order to be included in the Index).
How was it decided what indicators to include in the Web Index?
As this is an international study, we searched a very large number of international databases to find indicators that measure or proxy the dimensions under study. However, we constructed some basic minimum criteria that indicators and data sources must fulfill before we could include them in the Index, to try and reduce potential data bias.
Before an indicator is included in the Index, it needs to fulfill five basic criteria:
- Data providers have to be credible and reliable organizations (e.g., theirs is not a one-off dataset being published), and likely to continue to produce these data.
- Data releases should be regular, with new data released at least every 3 years.
- There should be at least two data years for each indicator, so that basic statistical inference could be made.
- The latest data year should be no older than three years back from publication year. For example, if the first Index is published in 2012, data must be available for 2009 and before. Ideally, we would like the data to be available up to 2011, but the worst we would accept is 2009.
- The data source should cover at least two-thirds of the sample of countries, so that possible bias—introduced by having a large number of indicators from one source that systematically does not cover one-third or more of the countries—is reduced.
What do you expect might be some challenges to working with the data?
Some challenges might include:
- navigating the distinction between raw and normalised data (you cannot compare across the two types of data)
- although we publish our methodology and all our own work and data, obtaining all the raw data for the secondary indicators from the Web Foundation's website will not be possible, as some data providers, such as ITU and World Economic Forum, did not allow us to re-publish their data via the Web Index. They insisted that people should instead go to their respective websites to obtain the data. In the case of WEF, this is a challenge as they only publish their indicators in pdf format, and as for the ITU, they charge a substantial amount of money for some of their data. Other data providers are also commercial and obtaining their raw data could only be done through a pay wall. For a full list of which indicators this applies to, see www.thewebindex.org.
- correlations are not always easy to interpret - need careful analysis
What is an interesting or fascinating aspect that you can already see in the data?
One indicator - the number of internet users per country - has the potential to tell most of the story, albeit without much depth. It tells us how many people use the Web in every country, without telling us how and why. The value of the Web to a nation may be much bigger than the sum of the value of the Web to individuals in that country. This is a story worth further investigation and research.
Also interesting is if you drastically change the weights applied to the 4 sub-Indexes (e.g. by giving some of them a value of zero), the resulting change in the rankings tells an interesting (but rather inconclusive) story.
How has data visualization already helped to improve our understanding of the World Wide Web?
Huge datasets can be a little 'thick' to see through, so visualisation - if done well - gets the point across very quickly. Understanding the impact of the Web could also seem daunting given its enormity, so the visualisation of 'data bites' to tell the story makes it quicker to see and easier to comprehend.
Participate in our Visualizing the Impact of the World Wide Web challenge by Wednesday, January 29, 2014 for a chance at $4,000 in prizes, including an invitation to attend a Web Foundation event in 2014 to celebrate the Web's 25th anniversary.