Home - posts tagged as Datamining
If you buy a baseball bat in the UK, should the police be alerted?
|If you buy a baseball bat in the United States, you're also likely to be in the market for balls and, perhaps, a glove. If you are buying a bat in the UK, one of the most common items is a balaclava. Amazon even offers them as a package (though without a discount).
|This data comes from Amazon.co.uk, the company's British Web site. The conclusion doesn't come from hooligans or prejudiced merchants. It is just the data talking: People who buy bats in the UK are also likely to buy masks. They also buy garden choppers, which in other cultures are known as machetes.
|Based on this evidence, I would bet that the average customer for a baseball bat in the UK wouldn't know much about drag bunting, much less the infield fly rule.
This raises important questions about correlation. Would police be justified in creating a registry of baseball bat buyers in the UK? If not, how many UK citizens would favor it? Think about it. Ninety-nine percent of us, with many nines after the decimal, would never dream of bombing an airplane, especially one we're riding in. And yet we're all treated as potential terrorists at the airport. And even to utter a joke at airport security is considered a pretty serious offense. And here the Brits could conceivably come up with a list of people who are likely to buy baseball bats, machetes and masks. Amazon has their names. Are they asking for them? Should they?
You could argue, of course, that the situation is more serious in the United States, where millions of people buy killing weapons and have a Constitutional right to do so. But the sample in the US is so large, and it includes a vast majority of people who buy them to defend themselves, to hunt, to go to target ranges. Millions of Americans believe that their guns defend them from violent people (which is one reason it's so hard to pass gun-control legislation). But is the same true of those who buy bats, masks and machetes in the UK?
|As a special bonus, I'm hunted down the lyrics to Balaclava, by the Arctic Monkeys. (YouTube) They don't mention a baseball bat, but you can picture it as part of the violent mix:
Running off over next doors garden
Before the hour is done
It's more a question of feeling
Than it is a question of fun
The confidence is the balaclava
I'm sure you'll baffle 'em good
With the ending wreak of salty cheeks
And runny makeup alone
Oh, will blood run down the face
Of a boy bewildered and scorned
And you'll find yourself in a skirmish
Where you wish you'd never been born
You tie yourself to the tracks
And there isn't no going back
And it's wrong, wrong, wrong
But we'll do it anyway 'cause we love a bit of trouble
Are you pulling her from a burning building
Or throwing her to the sharks?
Can only hope that the ending is a pleasurable as the start
The confidence is the balaclava
I'm sure you baffle 'em straight
And it's wrong, wrong, wrong
She can hardly wait
That's right, he won't let her out his sight
Now the shaggers perform
And the daggers are drawn
Who's the crooks in this crime?
Datamining discussion with Brown Political Review
|I had a Skype chat on Friday with Ben Wofford, co-editor in chief of the Brown Political Review. This one focused more on government datamining. (The AlJazeera one earlier in the week looked at tech companies.) If I seem a bit familiar with Ben, he's a close friend of the family.
AlJazeera: Google as Big Brother
Yesterday I learned about Dataium.
It's a company that aggregates Web surfing data and provides it to auto dealers so that they can predict the preferences of customers, and perhaps even set the appropriate price for each one. I heard about this from Ashkan Soltani
, an independent data researcher who was discussing data with me on an AlJezeera show, The Stream
Maybe I'm jaded, but I'm not too bothered that an auto dealer would seek out information about customers and tailor pitches and prices for them. Dealers have always discriminated. That's what they do. They focus on a customer's clothes and jewelry. They note which people are impressed by sound systems or leather upholstery, and they figure that they'll pay more for such things. Good car dealers read humans. They observe and they deal. And now they use data. Doesn't surprise me too much.
That was the pattern during our half-hour show. Ashkan and Birgitta Jonsdottir
, a Pirate Party MP from Iceland, objected to the datamining that big companies like Google and Facebook carry out. I was less bothered.
Birgitta, for example, complains that Google customizes results for what a specific user is most likely to be interested in. A cook who looks for dressing, for example, might see salad dressings among the top results, while a deer hunter might see a link about skinning a carcass. That doesn't bother me. But Birgitta and Ashkan fear that the big data companies are putting blinders on us, giving us little perspective outside of our own spheres.
I agree that we find ourselves in self-reinforcing information ghettos. But most of the responsibility is ours. We choose our Facebook friends, and they tend to link to a bunch of stories we agree with. Many of us avoid blogs and channels that challenge our views. If we wanted to broaden our horizons, we could change. It's not Google's fault.
Ashkan says that instead of tracking our behavior, companies might simply ask us for our preferences. But I'd actually rather have Amazon figure me out than answer a survey. There might be people with similar tastes who enjoy a book in a category I wouldn't know in advance. Amazon, for example, lined me up with John Vaillant's The Tiger
, one of the best books I've read in recent years. But I wouldn't have clicked Tigers or Siberia in a list of preferences.
My other point is that information workers, including journalists, need a vibrant advertising industry to support them (us). The datamining advertisers blunder in many ways. They hide what they're doing, they cross lines, they raise suspicions and fears. But I want them to figure out how to do it right, and to succeed.
Incidentally, a tip from Ashkan. For privacy and security online, he suggests eliminating flash from your computer.
My answers about government datamining
|An Italian journalist sent me questions the other day about the government datamining we've been learning about of late. I answered her questions. Since maybe only one of these sentences will turn up in her article, and in Italian, I figured I might as well blog our exchange:
Q: Do you believe that data mining is necessary to keep the US safe? What occured in Boston was just the last of a series of attacks and I have read that American public opinion is divided right now between the ones who favour safety and those who defend privacy.
A: Some degree of data mining is inevitable for a modern state to protect itself, not only from terrorists, but also from crime, traffic and industrial accidents and catastrophic weather. The question is not whether we sacrifice our privacy for safety, but instead how much the government can see, what the limits are and how they are enforced. Right now, it seems as though the government reserves the right to define all those limits for itself. It asks us to trust its judgment. I think the limits will have to be spelled out, and the citizens will demand and deserve some sort of oversight over these operations, perhaps by a congressional committee (even though confidence in Congress is at all-time lows).
The other important point is what the data can be used for, and what conclusions can be drawn from it. Imagine, for example, that in their hunt for terrorists dataminers find possible evidence of tax fraud, or perhaps a ring of pedophiles. Can we expect them to turn a blind eye to it? I don't think so. In that case, what begins as an invasion of privacy to protect the nation turns into a surveillance state.
Q: The Verizon and Prism scandals have definitely brought to light the fact that American citizens’ privacy cannot be taken for granted and is virtually non-existent. Do you think what is happening will help change the situation? I mean, will this monitoring of people’s private communication diminish or finish after the scandal or will it go on as usual?
A: In my book, The Numerati, I argue that data mining is pervasive, in government and industry, and will only grow. Privacy advocates are sure to put up a fight, as will a number of government regulators. But the trends favor datamining. Consider what most consumers in the world are interested in. Most of them want convenience and economic savings, a cleaner environment, less waste, and more safety. Data mining promises results in all of these areas.
I should add that the data economy is full of hype, and that many of the promises turn out to be exaggerations, or false. In my book, I argue that the most problematic area is in data mining for terrorism. Companies like Amazon and Google, after all, can study the behavior of billions of shoppers, while anti-terrorism data miners have very little behavioral data about terrorists.
Q: International web users are also involved and I myself may be under surveillance after sending these emails to you…. Some people willingly publish personal data on social networks, others do not realize the dangers. What is your advice to internet users? Will people become warier when using the web after this datagate?
I think people will grow increasingly sophisticated about their data, and how to protect the secrets that matter to them. That said, it is remarkable how careless people are. A decade ago, hotels in the United States were among the biggest purveyors of pornography. Guests paid $10 or $20 to watch pornographic channels in their rooms. Who knows how many of them stopped to wonder, or to care, whether they were sharing their choices with the management of the hotel. That business has declined sharply, because travelers now bring laptops to their rooms, and look at Web sites. So now, their Web wanderings are available not only to the hotel, which runs the Wi-fi network, but also to a host of Web sites and their partners. These people may say in surveys that they care about privacy. And perhaps they do. But their appetites and desires lead them to share intimate details about their cravings with a broad range of companies and yes, the government.
My point is that while people claim to care about privacy, they often are not willing to forego convenience, pleasure, economic savings or the promise of security for it.
Paradox: Are Big Data successes largely anecdotal?
|Those in the world of the Numerati, or Big Data, tend to pooh-pooh analysis based on anecdotes. And why not? They can easily be statistical outliers, and often they are. The trouble is that human beings relate to stories. They're so accessible, and perfectly suited to a sales pitch. So it shouldn't be any surprise that much of the hype around Big Data, whether in marketing or medicine, is based on stories.
Paul Barsch, a marketer at Teradata, makes this point in a blog post
. He writes:
The truth is that some companies are having wild success reporting, analyzing, and predicting on terabytes and in some cases petabytes of Big Data. But for every eBay, Google, or Amazon or Razorfishthere are thousands of companies stumbling, bumbling and fumbling through the process of Big Data analytics with little to show for it.
Sadly, success in Big Data doesn't lend itself, at least at this juncture, to statistical analysis. Customers and vendors keep their failures to themselves. And we usually only hear about them (as anecdotes) after someone gets fired. This leaves it to the successes (and the liars) to trumpet their greatness. The Big Data narrative is built on anecdotes.
Facebook's use of "dirty data"
|Here's an interesting experiment (that I don't have time to do). Go into your Facebook page and add up all of your "likes," and then imagine the portrait of yourself that they create. I'm guessing the result won't look much like you. This is because "likes" are about as meaningful as the word "like" in modern American English. I'm, like, not sure they mean much.
An excellent post by Steve Cheney
(hat tip: danah boyd
) delves into the issues raised by this so-called dirty data. The nut of his argument:
In computer architecture they call an out of date piece of data “dirty”. Accessing dirty data is bad, wasting time and causing more harm than good. And in this context, much of the structured data that makes up Graph Search is just that: totally irrelevant and dirty.
It turns out as much as half of the links between objects and interests contained in FB are dirty—i.e. there is no true affinity between the like and the object or it’s stale. Never mind does the data not really represent user intent... but the user did not even ‘like’ what she was liking.
He continues the post by explaining how this fits--or doesn't--into Facebook's advertising strategy. This documents the point I was making in my Times piece earlier this month: that advertisers struggle to figure out what to count in social media.
When you cut and paste, 33across is watching
|I'm browsing my Facebook news feed on a Saturday morning, and I see that one of my old BusinessWeek colleagues has linked to an article claiming that journalism is among the top 10 occupations for attracting psychopaths. Now you might think that I'd immediately "like" this fascinating article, and share it with all of my friends. But perhaps I have mixed feelings about it and only want to show it to a few. So I cut and paste the URL and plunk it into an email and send it.
This behavior is rampant. An advertising data company, 33Across
, has found that 82% of sharing online
is cut and pasted. Most people bypass the share buttons on publishers' Web sites. They often take too much time. Sometimes they require you to actually type in an email address (which for me is a dealbreaker). And they often don't let you add your own thoughts or annotations.
Now if you think about online marketers, as 33Across does, they're trying every day to figure out what people are interested in and, more specifically, which advertisements they'd be most likely to click. With social networks, they exploring treasure troves of human behavior and relationships. And yet if they rely on data from share buttons, they miss out on most of it.
So they dig further. They look at the cutting and pasting, and the sections of stories we highlight, and they learn more about us. This type of analysis, according to 33Across CEO Eric Wheeler, helps clients such as eBay and Macy's find the people statistically most likely to be the "next customers." Cutting and pasting is rampant on cooking sites. People love to share recipes. And the recipes they share say something about them. What's more, it helps publishers, e-commerce sites and social networks figure out which features and articles are the most engaging--and how to make more money from their traffic.
It's also very likely that social networks generate more traffic, and influence, than the standard count of "likes" and "retweets" would indicate. On some sites, according to Greg Levitt
, general manager for publishing solutions at 33Across, as much as 30% of the visits come from Facebook and Twitter.
IBM: Americans "Desire Ratio" on the rise
|Dataminers at IBM have been burrowing through social media postings to see how Americans are feeling. Heading into Memorial Day weekend, it's sounding like the mood is improving. References to gas prices, which have been dropping of late, register postive by a five to one ratio, compared to an even balance a year ago. IBM has a new measurement--the "Desire Index"--which reflects the ratio of positive to negative comments about shopping and Memorial Day travel. These desires have rocketed up to 6.5/1 this year, from a downbeat 1.3/1 a year ago.
Now, as those of you who read The Numerati know, I've spent time talking to people who pull sentiments out of data. (The company I looked at then, Umbria Communications, was later bought by a unit of McGraw-Hill, the company I happened to be working for at the time.) Of course it's not an exact science. The machines miss sarcasm, like "Oh yeah, I'm thrilled to be driving this Memorial Day weekend with four screaming babies and my suicidal father in law...." Nonetheless, I'm sure Twitter and blogs provide a large enough sample to power past such small misunderstandings. The IBM team no doubt gets the big trends right.
Still, wondering what sort of happy talk the computer is interpreting on gas prices, I call up Twitter and search for "gas :)"
Well, it turns out, some are pretty clear.
Inside the NSA matrix: What can they learn?
|Anyone concerned about privacy might read James Banford's cover story in Wired: Inside the Matrix. It puts together the elements of the coming total information awareness state--but with one big missing piece.
pieces he describes include 1) an enormous new National Security Agency
datacenter in Utah that will store virtually all of our communications
and digital musings, and 2) ultra-powerful supercomputers designed to
crack the cryptography protecting much of that information. So
eventually, the thinking goes, the NSA will be able to read our
communications, study our pictures, and trace our patterns backwards in
time. Theoretically, they'll be able to profile each one of us: What we
say, where we go, who we hang around (and even sleep) with.
one his chief former NSA source tells him, holding his thumb and
forefinger close together, " We are, like, that far from a turnkey
Now, the missing piece. Nowhere in the
article do we learn how the NSA will actually make sense of all of this
data, especially for counter-terrorism. What are the patterns of
potential terrorists? This is something I grappled with in The Numerati.
I went to Jeff Jonas for insights. His point was that this type of information is valuable once you are focused on a suspect, but that you cannot hope to find potential suspects by datamining exabytes of phone calls and e-mails. He called that "boiling the ocean."
The trouble is that while the government may have detailed information about hundreds of millions of taxpayers and an equal number of drivers, they have the historical records on only a relative handful of terrorists. What is it about the patterns of those terrorists that they might be able to find in the collected communications of the entire planet earth? That is the unanswered question in the Wired story. And I would suspect that that's because there is no answer.
So then, assuming that the NSA can crack these troublesome codes, what can they do with all of this data? Well, they can trace the people they actually suspect of crimes. But theoretically, they could do that without including the rest of us. Of course, they could (and probably will) continue to hunt for the patterns of potential terrorists, though the prospects don't look too promising.
It might be more fruitful to focus datamining on well established patterns, and on crimes that more of us commit, like tax evasion. If that happens, the small coterie of privacy advocates worried about the turnkey totalitarian state will quickly gain millions of angry supporters.
One of the Numerati goes to work for Obama
I remember talking to Rayid Ghani about "barnacles." Those are
the shoppers who go to extraordinary lengths to buy only
things on sale. Ghani, a researcher at Accenture Labs, told me that
stores could identify likely barnacles in their data, and then perhaps
take measures to "fire" them, ie. send them shopping elsewhere.
Barnacles, after all, cost money.
At the time, Ghani and his team
were analyzing loads of supermarket data, and trying to figure out how
to lead shoppers toward the items they'd most likely buy (and others the
supermarket wanted to get rid of). This research would go into the "Shopper"
chaper in The Numerati, in which Ghani was the lead character.
Now, as the NYTimes reports, Ghani is chief scientist
for the Obama campaign. The mission is to unearth different tribes of
voters from the analysis of data, and then figure out the best way to
"optimize" them, as donors, organizers, or just plain voters. If Ghani
had made this switch earlier, I could have featured him in the Voter
chapter. But that's one of the hallmarks of the Numerati. Their skills
enable them to switch from one field to the next.
RT @marthagabriel: "It is possible to store the mind with a million facts and still be entirely uneducated."
-- Alec Bourne #quote #goodmor…
follow me on twitter
Kirkus - Kirkus Reviews
Andrew Dunn - Bloomberg News
Culture Mob - Dan Sampson
Shelfari (Amazon) - Tom Nissley
read more reviews
The Boost: an excerpt
- April 15, 2014
My horrible Superbowl weekend, in perspective
- February 3, 2014
My coming novel: Boosting human cognition
- May 30, 2013
Why Nate Silver is never wrong
- November 8, 2012
The psychology behind bankers' hatred for Obama
- September 10, 2012
"Corporations are People": an op-ed
- August 16, 2011
Wall Street Journal excerpt: Final Jeopardy
- February 4, 2011
Why IBM's Watson is Smarter than Google
- January 9, 2011
- October 3, 2010
The coming privacy boom
- August 17, 2010
The appeal of virtual
- May 18, 2010
My next book: IBM's Jeopardy mission
- March 22, 2010