Home - posts tagged as Datamining
Paradox: Are Big Data successes largely anecdotal?
|Those in the world of the Numerati, or Big Data, tend to pooh-pooh analysis based on anecdotes. And why not? They can easily be statistical outliers, and often they are. The trouble is that human beings relate to stories. They're so accessible, and perfectly suited to a sales pitch. So it shouldn't be any surprise that much of the hype around Big Data, whether in marketing or medicine, is based on stories.
Paul Barsch, a marketer at Teradata, makes this point in a blog post
. He writes:
The truth is that some companies are having wild success reporting, analyzing, and predicting on terabytes and in some cases petabytes of Big Data. But for every eBay, Google, or Amazon or Razorfishthere are thousands of companies stumbling, bumbling and fumbling through the process of Big Data analytics with little to show for it.
Sadly, success in Big Data doesn't lend itself, at least at this juncture, to statistical analysis. Customers and vendors keep their failures to themselves. And we usually only hear about them (as anecdotes) after someone gets fired. This leaves it to the successes (and the liars) to trumpet their greatness. The Big Data narrative is built on anecdotes.
Facebook's use of "dirty data"
|Here's an interesting experiment (that I don't have time to do). Go into your Facebook page and add up all of your "likes," and then imagine the portrait of yourself that they create. I'm guessing the result won't look much like you. This is because "likes" are about as meaningful as the word "like" in modern American English. I'm, like, not sure they mean much.
An excellent post by Steve Cheney
(hat tip: danah boyd
) delves into the issues raised by this so-called dirty data. The nut of his argument:
In computer architecture they call an out of date piece of data “dirty”. Accessing dirty data is bad, wasting time and causing more harm than good. And in this context, much of the structured data that makes up Graph Search is just that: totally irrelevant and dirty.
It turns out as much as half of the links between objects and interests contained in FB are dirty—i.e. there is no true affinity between the like and the object or it’s stale. Never mind does the data not really represent user intent... but the user did not even ‘like’ what she was liking.
He continues the post by explaining how this fits--or doesn't--into Facebook's advertising strategy. This documents the point I was making in my Times piece earlier this month: that advertisers struggle to figure out what to count in social media.
When you cut and paste, 33across is watching
|I'm browsing my Facebook news feed on a Saturday morning, and I see that one of my old BusinessWeek colleagues has linked to an article claiming that journalism is among the top 10 occupations for attracting psychopaths. Now you might think that I'd immediately "like" this fascinating article, and share it with all of my friends. But perhaps I have mixed feelings about it and only want to show it to a few. So I cut and paste the URL and plunk it into an email and send it.
This behavior is rampant. An advertising data company, 33Across
, has found that 82% of sharing online
is cut and pasted. Most people bypass the share buttons on publishers' Web sites. They often take too much time. Sometimes they require you to actually type in an email address (which for me is a dealbreaker). And they often don't let you add your own thoughts or annotations.
Now if you think about online marketers, as 33Across does, they're trying every day to figure out what people are interested in and, more specifically, which advertisements they'd be most likely to click. With social networks, they exploring treasure troves of human behavior and relationships. And yet if they rely on data from share buttons, they miss out on most of it.
So they dig further. They look at the cutting and pasting, and the sections of stories we highlight, and they learn more about us. This type of analysis, according to 33Across CEO Eric Wheeler, helps clients such as eBay and Macy's find the people statistically most likely to be the "next customers." Cutting and pasting is rampant on cooking sites. People love to share recipes. And the recipes they share say something about them. What's more, it helps publishers, e-commerce sites and social networks figure out which features and articles are the most engaging--and how to make more money from their traffic.
It's also very likely that social networks generate more traffic, and influence, than the standard count of "likes" and "retweets" would indicate. On some sites, according to Greg Levitt
, general manager for publishing solutions at 33Across, as much as 30% of the visits come from Facebook and Twitter.
IBM: Americans "Desire Ratio" on the rise
|Dataminers at IBM have been burrowing through social media postings to see how Americans are feeling. Heading into Memorial Day weekend, it's sounding like the mood is improving. References to gas prices, which have been dropping of late, register postive by a five to one ratio, compared to an even balance a year ago. IBM has a new measurement--the "Desire Index"--which reflects the ratio of positive to negative comments about shopping and Memorial Day travel. These desires have rocketed up to 6.5/1 this year, from a downbeat 1.3/1 a year ago.
Now, as those of you who read The Numerati know, I've spent time talking to people who pull sentiments out of data. (The company I looked at then, Umbria Communications, was later bought by a unit of McGraw-Hill, the company I happened to be working for at the time.) Of course it's not an exact science. The machines miss sarcasm, like "Oh yeah, I'm thrilled to be driving this Memorial Day weekend with four screaming babies and my suicidal father in law...." Nonetheless, I'm sure Twitter and blogs provide a large enough sample to power past such small misunderstandings. The IBM team no doubt gets the big trends right.
Still, wondering what sort of happy talk the computer is interpreting on gas prices, I call up Twitter and search for "gas :)"
Well, it turns out, some are pretty clear.
Inside the NSA matrix: What can they learn?
|Anyone concerned about privacy might read James Banford's cover story in Wired: Inside the Matrix. It puts together the elements of the coming total information awareness state--but with one big missing piece.
pieces he describes include 1) an enormous new National Security Agency
datacenter in Utah that will store virtually all of our communications
and digital musings, and 2) ultra-powerful supercomputers designed to
crack the cryptography protecting much of that information. So
eventually, the thinking goes, the NSA will be able to read our
communications, study our pictures, and trace our patterns backwards in
time. Theoretically, they'll be able to profile each one of us: What we
say, where we go, who we hang around (and even sleep) with.
one his chief former NSA source tells him, holding his thumb and
forefinger close together, " We are, like, that far from a turnkey
Now, the missing piece. Nowhere in the
article do we learn how the NSA will actually make sense of all of this
data, especially for counter-terrorism. What are the patterns of
potential terrorists? This is something I grappled with in The Numerati.
I went to Jeff Jonas for insights. His point was that this type of information is valuable once you are focused on a suspect, but that you cannot hope to find potential suspects by datamining exabytes of phone calls and e-mails. He called that "boiling the ocean."
The trouble is that while the government may have detailed information about hundreds of millions of taxpayers and an equal number of drivers, they have the historical records on only a relative handful of terrorists. What is it about the patterns of those terrorists that they might be able to find in the collected communications of the entire planet earth? That is the unanswered question in the Wired story. And I would suspect that that's because there is no answer.
So then, assuming that the NSA can crack these troublesome codes, what can they do with all of this data? Well, they can trace the people they actually suspect of crimes. But theoretically, they could do that without including the rest of us. Of course, they could (and probably will) continue to hunt for the patterns of potential terrorists, though the prospects don't look too promising.
It might be more fruitful to focus datamining on well established patterns, and on crimes that more of us commit, like tax evasion. If that happens, the small coterie of privacy advocates worried about the turnkey totalitarian state will quickly gain millions of angry supporters.
One of the Numerati goes to work for Obama
I remember talking to Rayid Ghani about "barnacles." Those are
the shoppers who go to extraordinary lengths to buy only
things on sale. Ghani, a researcher at Accenture Labs, told me that
stores could identify likely barnacles in their data, and then perhaps
take measures to "fire" them, ie. send them shopping elsewhere.
Barnacles, after all, cost money.
At the time, Ghani and his team
were analyzing loads of supermarket data, and trying to figure out how
to lead shoppers toward the items they'd most likely buy (and others the
supermarket wanted to get rid of). This research would go into the "Shopper"
chaper in The Numerati, in which Ghani was the lead character.
Now, as the NYTimes reports, Ghani is chief scientist
for the Obama campaign. The mission is to unearth different tribes of
voters from the analysis of data, and then figure out the best way to
"optimize" them, as donors, organizers, or just plain voters. If Ghani
had made this switch earlier, I could have featured him in the Voter
chapter. But that's one of the hallmarks of the Numerati. Their skills
enable them to switch from one field to the next.
Big Data and math
|Steve Lohr has a good round-up of Big Data trends in the Times. It has very similar themes to The Numerati and to Ian Ayres' SuperCrunchers.
In fact, reading the story reminded of my BusinessWeek story, Math Will Rock Your World, which led to The Numerati. It argues the same points, but instead of focusing on the subject of the investigation--data--it looks at the tools employed, mathematics and computers (without, you might note, shedding any light on how the mathematicians and computer scientists carry out this work.)
As I've mentioned before, Steve Adler, the editor in chief, started the process by telling me to write a cover story on math. So I started interviewing mathematicians. I was learning all sorts of interesting things about encryption and operations research, but I didn't really see the BusinessWeek cover story until I delved into the world of data. In the end, I wrote a story about Big Data--but kept math in the headline.
A few paragraphs from that story:
The world is moving
into a new age of numbers. Partnerships between mathematicians and
computer scientists are bulling into whole new domains of business and
imposing the efficiencies of math. This has happened before. In past
decades, the marriage of higher math and computer modeling transformed
science and engineering. Quants turned finance upside down a generation
ago. And data miners plucked useful nuggets from vast consumer and
business databases. But just look at where the mathematicians are now.
They're helping to map out advertising campaigns, they're changing the
nature of research in newsrooms and in biology labs, and they're
enabling marketers to forge new one-on-one relationships with customers.
As this occurs, more of the economy falls into the realm of numbers.
Says James R. Schatz, chief of the mathematics research group at the
National Security Agency: "There has never been a better time to be a
From fledglings like Inform to tech powerhouses such as IBM (IBM
), companies are hitching mathematics to business in ways that would
have seemed fanciful even a few years ago. In the past decade, a sizable
chunk of humanity has moved its work, play, chat, and shopping online.
We feed networks gobs of digital data that once would have languished on
scraps of paper -- or vanished as forgotten conversations. These slices
of our lives now sit in databases, many of them in the public domain.
From a business point of view, they're just begging to be analyzed. But
even with the most powerful computers and abundant, cheap storage,
companies can't sort out their swelling oceans of data, much less build
businesses on them, without enlisting skilled mathematicians and
The rise of mathematics is heating up the job market for luminary
quants, especially at the Internet powerhouses where new math grads land
with six-figure salaries and rich stock deals. Tom Leighton, an
entrepreneur and applied math professor at Massachusetts Institute of
Technology, says: "All of my students have standing offers at Yahoo! and Google." Top mathematicians are becoming a new global elite. It's a force of
barely 5,000, by some guesstimates, but every bit as powerful as the
armies of Harvard University MBAs who shook up corner suites a
Baseball playoffs begin: Moneyball season over
| Phillies' pitcher Roy Halladay: Statistics indicate that he could win... or lose
It's time to forget Moneyball and statistical analysis. The
162-game baseball season, the six-month marathon in which statistics
have the time to work their magic, is over. As play-offs begin, managers
might as well return to their divining rods or the study of patterns on the bottom of their coffee cups. They're
entering a season defined largely by unmeasurables such as confidence,
feel, and most importantly, luck.
The Phillies-Cardinals match-up Saturday pits Roy Halladay, last year's
Cy Young winner, against Kyle Lohse, a mid-rotation starter through his
career. Lohse, through an 11-year career, has won about as many as he has lost. Halladay, in 14 years,
has won two games for every game he's lost. This year he went 19-6.
Still, he lost six games, and could lose another against Lohse. The game
might turn on one pitch and the difference of a quarter inch on Albert
Pujols' bat. That tiny adjustment turns a high fly ball into a
tape-measure homer. Over a long season, Halladay establishes his
superiority. In one game, or even a five-game series, throw the stats out the window.
And yet, because of the new Moneyball movie, we're sure
to hear every day about the wonders of baseball's quants, led by the
prime number-cruncher, Bill James. In fact, it already surfaced in the recent pennant races. In one article,
summing up the chances of the fading Boston Red Sox, a Harvard data
cruncher named Andrew Mooney surveyed the last ten days of the season
and counseled the Sox not to worry:
You’re in a funk, you say. You’ve lost nine of your last eleven,
while the Rays have won eight of 10. Actually, this could be just as
much a source of comfort as a cause for alarm. Simple probabilities
indicate that neither team is likely to continue at such a rate for the
remainder of the season; that’s just the nature of streaks. Need
evidence? You started the season 2-10.
You’ve got a 10-game homestand coming (winning percentage at Fenway:
.592), while the Rays will be away for their next 11 (winning percentage
on the road: .557).
My question: What do "simple
probabilities" count for when one team is consumed by dread and its
rival buoyed by rising confidence? How do you measure the impact of such
things? And what are such measurements worth in a sample of only 10
games? Nothing, I'd say.
By instituting two series of play-offs,
Major League Baseball essentially created a second season. While the
first is a marathon, in which statistics rule, the second is an
eight-team sprint. The best team can win this second season. But for
this to happen, it must also be hot and lucky.
Just one more
point about the movie Moneyball. The analysis in the movie boiled
baseball down to its essense: The team that scores more runs than its
opponents over 162 games will likely wind up on top. But for some
reason, the film glided
over the biggest factor in this equation: starting pitching. It didn't
mention even once that Billy Beane's Oakland A's had the best trio of
starting pitchers in the Major Leagues. Barry Zito won the Cy Young
Award that year. He was absent from the movie. Mark Mulder and Tim
Hudson were magnificent. We got brief glimpses of their uniforms.
So the movie would lead us to believe that the A's won their division that year because they played a converted catcher
at first base, swung a smart deal for a left-handed reliever, got rid
of a distracting Giambi brother, and urged players to take walks. The A's
could have done all of those things, and without their trio of great
pitchers, they would have ended up with a losing record. That's why, as The Sconz says: Moneyball is a lie.
You will be monitored, step by step
|I think I should have been more emphatic in The Numerati. From the 2020s on, practically every senior in the industrial world will be monitored by sensors in his or her home. That includes me and, if you're not already in your golden years, it includes you. My "Patient" chapter in the book focused on Intel's efforts to monitor seniors in the Portland area, research that also spread into Ireland. National economies desperately need to save money on health care, and developing technology that can monitor seniors and intervene before they get sick, and before they fall, is simply too sensible to pass up.
Now I get this news about a similar initiative in Missouri.One interesting wrinkle is the use of technology originally developed for gaming, Microsoft's Kinect. And they use the depth perception of the camera to view the subjects as silhouettes, which protects a bit of their privacy. You can assume that if you don't reach your 70s until, say, the 2030s, the technology will be remarkably effective and, at the same time, discreet. Then again, once it works well on older people, why not extend it to everyone else? That's the future I see.
How much data can Asthma inhalers provide?
|It seems like the perfect combination for asthma research: inhalers
equipped with GPS, so that each use of the medication comes with a time
and place tag. Teradata's Paul Barsch cites an Economist article about Asthmopolis, the new tool to track asthma.
idea is that people can map their own patterns, and come to understand,
and hopefully avoid, places and conditions that provoke asthma attacks.
And researchers, studying the data from thousands of users, might learn
That's where other variables come
in. The GPS data may show that 20 users in Youngstown, Ohio, suffer
exacerbations between 9 p.m. and 10 p.m. on a Tuesday night. But how
many of them are just traveling through Youngstown in their car? Should
they count? And how many of them are spending the evening in close
quarters with their cats or dogs? What did people eat? Since each of us is a complex system, the challenge with medical monitoring is to pick up as much detail of the entire life as possible.
The key--at least until English-savvy machines
like Watson are on the case--is to get valuable diary data into formats
machines can process. Then systems like Asthmopolis could really make a
difference, both for individuals and society at large. The other key, as
Barsch notes, is to do this in a way that protects people's privacy.
None of the medical monitoring will work without that.
RT @sbauman: Report Uncovers Huge Business Opportunities in Healthcare Data Analytics
follow me on twitter
Kirkus - Kirkus Reviews
Andrew Dunn - Bloomberg News
Culture Mob - Dan Sampson
Shelfari (Amazon) - Tom Nissley
read more reviews
Why Nate Silver is never wrong
- November 8, 2012
The psychology behind bankers' hatred for Obama
- September 10, 2012
"Corporations are People": an op-ed
- August 16, 2011
Wall Street Journal excerpt: Final Jeopardy
- February 4, 2011
Why IBM's Watson is Smarter than Google
- January 9, 2011
- October 3, 2010
The coming privacy boom
- August 17, 2010
The appeal of virtual
- May 18, 2010
My next book: IBM's Jeopardy mission
- March 22, 2010
- November 12, 2009
BusinessWeek cannot afford to stay within McGraw-Hill
- August 6, 2009
How to remake BusinessWeek?
- July 16, 2009