Visualizing Ngrams

A new tool for visualizing words presages a transformation in social science. In the era of bits and bytes, will data emerge as its own archetype for creating knowledge?

If you've been websurfing anytime since mid-December, you've likely encountered the "Ngram Viewer," a tool for visualizing words and phrases in Google's vast digital books archive. You also probably noticed the chatter that surged across Twitter, the blogosphere, and mainstream media as thousands of folks — from experts to the merely curious — tried their hand at the tool, and some began to speculate if these Tuftian little charts don't presage a radical transformation in the way we formulate questions and share ideas.

Linked to a database of more than 500 billion words, the Ngram Viewer is part of Google's ambitious effort to scan 15 million of the estimated 130 million titles that have been printed since around 1440, when Johannes Gutenberg perfected the mechanical printing press. By simply entering a string of unbroken letters (a 1-gram like "Russian"  or a 2-gram like "Russian dressing") users can trawl through more than 5.2 million of these titles. The near-instantaneous result is a simple line graph that illustrate how usage of that word or phrase — the current max for "n" is five — has changed over time. Most of the represented books are in English works between 1800 and 2000, but the database also includes titles in Spanish, French, Russian, Chinese, and Hebrew.

The addictive simplicity of Ngrams have inspired a flurry of user-generated queries: In recent weeks, armchair enthusiasts have have mapped everything from the ascendancy of the word "tofu" over "hotdog" to how usage of the word "racism" has ebbed and flowed in different languages. Environmental bloggers noted how usage of "climate change" has surpassed "global warming," while a reporter at the San Jose Mercury Sun wrote that "'Freud'" is more deeply ingrained in our literature than 'Galileo,' 'Darwin,' or 'Einstein.' And that "'God' got a lot of ink in the early 19th century, but now needs a new publicist — use of the word declined as the 20th century became more secular."

Though the Viewer tool is open to the general public, the intent of the N-Gram project is a scholarly one. In a paper published in the December 17 issue of Science, Harvard researchers Erez Lieberman Aiden and Jean-Baptiste Michel describe using Google's vast digital books archive to examine trends in censorship, spread of innovation, and the effects of youth and profession in fame. Dubbing their approach "culturomics," they argue that a quantitative approach to the study of human culture could herald a new paradigm for social science, one that is more rigorous and analytical.

Not surprisingly, the "culturomics" claim touched off a vibrant discussion of its own. Covering the story for the New York Times', Patricia Cohen suggested that if the 20th century was a chronicle of idologies, the 21st may be the heyday of process. "Digitally savvy humanists argue it is time to stop looking for inspiration in the next political or philosophical "ism" and start exploring how technology is changing our understanding of the liberal arts. This latest frontier is about method, they say, using powerful technologies and vast stores of digitized materials that previous humanities scholars did not have."

Whether "data" itself constitutes a new archetype for research remains a hotly contested question (See George Mason Professor Dan Cohen's blog for an excellent, nuanced analysis) One thing is clear, however: for the design community, Google's database, available to anyone with an Internet connection, offers an enormous cache of raw creative material. As Peter Gassner of puts it, "What I find most exciting about this project is that Google enables everyone (no programming skills necessary) to ask questions and dig into a century old corpus of accumulated wisdom in over 5 million books in 6 languages."

And while the graphs generated by the N-Gram Viewer are Spartan in their visual appeal, Gassner points out that the purpose of the tool isn't aesthetic anyway; it's to "give first insights and spawn ideas, which can then lead to a deeper analysis."

Engineer Chris Harrison has already taken N-Gram data to that deeper analytical level with the stunning visualization shown here. Harrison, a PhD student in the Human-Computer Interaction Institute at Carnegie Mellon University, had early access to Google's data, when in 2006 the company released a massive set of 3-grams ("Carnegie Mellon University," for instance.)

In this visualization, Harrison wanted to compare two sets of 3-grams. He started each 3-gram with a different word: "He" and "She." He then identified the top 120 3-grams for each word. The frequencies of the second word in the 3-gram were combined, sorted, and displayed in decreasing order of frequency-of-use. He repeated the process for ranking the third (and final) word in the 3-gram. He also sized the words according to the square root of their use frequencies and used color-coded lines to trace paths between all of the 3-grams. "He was married" for example, or "She could never."

The result is a striking visual comparison of how the two subject nouns "He" and "She" are used in the English language. According to Harrison, the commonalities are as interesting as the differences." Among the top 120 3-grams, "He" and "She" have many second words in common but diverge on some intriguing ones. For example, only "He" connects to "argues," while only "She" connects to "love."

Harrison has also rendered 2-grams (or "bi-grams") in these graphics he calls "Word Associations" and "Word Spectrum Visualizations."

Will culturomics emerge as a new discipline for social science in information era? Simply generating data, of course, doesn't by itself lead to new insights or new knowledge. The Large Hadron Collider and the Human Genome Project have taught us that much already. And yet when the speed and scope of doing research so vastly expands, it's hard to imagine that the questions we ask won't also fundamentally change.

Visualizations, by lending visible organization to data, information, knowledge, and wisdom will help us to understand whether we're merely text-mining, or learning something new.

