I Know What You Did Last Spring

Sunday, 06 August 2006 17:34 Written by xɐɥɹʌ 0 Comments

On Sunday, August 6, 2006, news broke in the blogsphere that AOL had released customer search data.^[1] The (since removed) press release^[2] from their research page^[3] notes in part:

This collection consists of ~20M web queries collected from ~650k users over three months. The data is sorted by anonymous user ID and sequentially arranged. The press release also provided a generic statistical breakdown of the data along with a request that anyone using it cite the researchers who presented their findings at a conference in Japan.

Basic Collection Statistics Dates:

01 March, 2006 – 31 May, 2006

Normalized queries:

36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for “next page” of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user IDs

Please reference the following publication when using this collection:

A Picture of Search

Leave it to AOL to violate consumer privacy. Again^[4] The MSM caught wind of the rather spectacular snafu^[5] on Monday, August 7, 2006 and AOL yanked the data^[6] while making a public apology the following day. By Thursday, August 10, 2006, the fubar was being characterized as “an accident.”^[7]

Ironically, in the very same PC World article, Google CEO, Eric Schmidt, bragged, “The release of a database of online search histories that has gotten AOL into so much hot water could never happen at Google. We have very sophisticated security plans for an attack on information.” It seems to have escaped Mr. Schmidt that the privacy breach had not a thing to do with “an attack on information.” ^[8]

In the meantime, execs were busy scrambling, backpedalling or both, while attempting to assure the public that the data had been deidentified. And just how deidentified was the data? On Wednesday, August 9, 2006, New York Times journalists Michael Barbaro and Tom Zeller Jr. introduced us to one of the many so-called deidentified individuals in their article,^[9] “A Face Is Exposed for AOL Searcher No. 4417749.” One gentleman, Elliot Bäck, found everything from credit cards to social security numbers^[10] among the so-called deidentified records.

Though a few were calling for a boycott^[11] and heads were rolling at AOL Time Warner,^[12] some took advantage of the situation–being what is was–or as appropriately characterized by Ellen Nakashima, “turned the gold mine of actuarial search data into a veritable cottage industry“^[13] of online data mining and analysis. And yet others began providing web site and bit torrent mirrors so people could download their own private copies.

In the meantime, the data is out there and as with the proverbial cat, there is no way of returning it to its metaphorical bag. Even so, the implications are evident. For its presence brings yet another dimension to the information superhighway landscape. After all, those who truly understand how to unleash the power of search will indeed have a means to know what you did last spring. That is, if you were among the 650K individuals who utilized AOLs search features during that period.

I Know What You Did Last Spring

Hangouts