Stephen Baker

Home - posts tagged as Datamining

Paradox: Are Big Data successes largely anecdotal?
March 8, 2013Datamining

Those in the world of the Numerati, or Big Data, tend to pooh-pooh analysis based on anecdotes. And why not? They can easily be statistical outliers, and often they are. The trouble is that human beings relate to stories. They're so accessible, and perfectly suited to a sales pitch. So it shouldn't be any surprise that much of the hype around Big Data, whether in marketing or medicine, is based on stories.

Paul Barsch, a marketer at Teradata, makes this point in a blog post. He writes:

The truth is that some companies are having wild success reporting, analyzing, and predicting on terabytes and in some cases petabytes of Big Data. But for every eBay, Google, or Amazon or Razorfishthere are thousands of companies stumbling, bumbling and fumbling through the process of Big Data analytics with little to show for it.

Sadly, success in Big Data doesn't lend itself, at least at this juncture, to statistical analysis. Customers and vendors keep their failures to themselves. And we usually only hear about them (as anecdotes) after someone gets fired. This leaves it to the successes (and the liars) to trumpet their greatness. The Big Data narrative is built on anecdotes.

add comment link to post share:

Facebook's use of "dirty data"
January 22, 2013Datamining

Here's an interesting experiment (that I don't have time to do). Go into your Facebook page and add up all of your "likes," and then imagine the portrait of yourself that they create. I'm guessing the result won't look much like you. This is because "likes" are about as meaningful as the word "like" in modern American English. I'm, like, not sure they mean much.

An excellent post by Steve Cheney (hat tip: danah boyd) delves into the issues raised by this so-called dirty data. The nut of his argument:

In computer architecture they call an out of date piece of data “dirty”. Accessing dirty data is bad, wasting time and causing more harm than good. And in this context, much of the structured data that makes up Graph Search is just that: totally irrelevant and dirty.

It turns out as much as half of the links between objects and interests contained in FB are dirty—i.e. there is no true affinity between the like and the object or it’s stale. Never mind does the data not really represent user intent... but the user did not even ‘like’ what she was liking. 

He continues the post by explaining how this fits--or doesn't--into Facebook's advertising strategy. This documents the point I was making in my Times piece earlier this month: that advertisers struggle to figure out what to count in social media. 

 

 

 

 


add comment link to post share:

When you cut and paste, 33across is watching
January 5, 2013Datamining

I'm browsing my Facebook news feed on a Saturday morning, and I see that one of my old BusinessWeek colleagues has linked to an article claiming that journalism is among the top 10 occupations for attracting psychopaths. Now you might think that I'd immediately "like" this fascinating article, and share it with all of my friends. But perhaps I have mixed feelings about it and only want to show it to a few. So I cut and paste the URL and plunk it into an email and send it.

This behavior is rampant. An advertising data company, 33Across, has found that 82% of sharing online is cut and pasted. Most people bypass the share buttons on publishers' Web sites. They often take too much time. Sometimes they require you to actually type in an email address (which for me is a dealbreaker). And they often don't let you add your own thoughts or annotations.

Now if you think about online marketers, as 33Across does, they're trying every day to figure out what people are interested in and, more specifically, which advertisements they'd be most likely to click. With social networks, they exploring treasure troves of human behavior and relationships. And yet if they rely on data from share buttons, they miss out on most of it.

So they dig further. They look at the cutting and pasting, and the sections of stories we highlight, and they learn more about us. This type of analysis, according to 33Across CEO Eric Wheeler, helps clients such as eBay and Macy's find the people statistically most likely to be the "next customers." Cutting and pasting is rampant on cooking sites. People love to share recipes. And the recipes they share say something about them. What's more, it helps publishers, e-commerce sites and social networks figure out which features and articles are the most engaging--and how to make more money from their traffic.

It's also very likely that social networks generate more traffic, and influence, than the standard count of "likes" and "retweets" would indicate. On some sites, according to Greg Levitt, general manager for publishing solutions at 33Across,  as much as 30% of the visits come from Facebook and Twitter.

add comment link to post share:

IBM: Americans "Desire Ratio" on the rise
May 24, 2012Datamining

Dataminers at IBM have been burrowing through social media postings to see how Americans are feeling. Heading into Memorial Day weekend, it's sounding like the mood is improving. References to gas prices, which have been dropping of late, register postive by a five to one ratio, compared to an even balance a year ago. IBM has a new measurement--the "Desire Index"--which reflects the ratio of positive to negative comments about shopping and Memorial Day travel. These desires have rocketed up to 6.5/1 this year, from a downbeat 1.3/1 a year ago.

Now, as those of you who read The Numerati know, I've spent time talking to people who pull sentiments out of data. (The company I looked at then, Umbria Communications, was later bought by a unit of McGraw-Hill, the company I happened to be working for at the time.) Of course it's not an exact science. The machines miss sarcasm, like "Oh yeah, I'm thrilled to be driving this Memorial Day weekend with four screaming babies and my suicidal father in law...." Nonetheless, I'm sure Twitter and blogs provide a large enough sample to power past such small misunderstandings. The IBM team no doubt gets the big trends right.

Still, wondering what sort of happy talk the computer is interpreting on gas prices, I call up Twitter and search for "gas :)"

Well, it turns out, some are pretty clear.

:) I like the sound of that! RT : Weekend forecast: Hot(ter) temps, cheap(er) gas.

I think a computer even less sophisticated than Watson could draw a thumbs up from that post. But others are a little less clear:

RT : My boiler has been off for 3 days :) // middle finger up to the gas consumption

Officially has enough gas money to get to and from TX in October, now to save for food. :)

And then some of them require context on the part of the reader. A few years ago, gas near $3 per gallon would have brought forth exclamation points from outrage. Now, things appear to have changed:

RT : is at Exxon again and gas is $2.92 ! I'm omw !!!!! :)

 



add comment link to post share:

Inside the NSA matrix: What can they learn?
March 19, 2012Datamining

Anyone concerned about privacy might read James Banford's cover story in Wired: Inside the Matrix. It puts together the elements of the coming total information awareness state--but with one big missing piece.

The pieces he describes include 1) an enormous new National Security Agency datacenter in Utah that will store virtually all of our communications and digital musings, and 2) ultra-powerful supercomputers designed to crack the cryptography protecting much of that information. So eventually, the thinking goes, the NSA will be able to read our communications, study our pictures, and trace our patterns backwards in time. Theoretically, they'll be able to profile each one of us: What we say, where we go, who we hang around (and even sleep) with.

As one his chief former NSA source tells him, holding his thumb and forefinger close together, " We are, like, that far from a turnkey totalitarian state."

Now, the missing piece. Nowhere in the article do we learn how the NSA will actually make sense of all of this data, especially for counter-terrorism. What are the patterns of potential terrorists? This is something I grappled with in The Numerati. I went to Jeff Jonas for insights. His point was that this type of information is valuable once you are focused on a suspect, but that you cannot hope to find potential suspects by datamining exabytes of phone calls and e-mails. He called that "boiling the ocean."

The trouble is that while the government may have detailed information about hundreds of millions of taxpayers and an equal number of drivers, they have the historical records on only a relative handful of terrorists. What is it about the patterns of those terrorists that they might be able to find in the collected communications of the entire planet earth? That is the unanswered question in the Wired story. And I would suspect that that's because there is no answer.

So then, assuming that the NSA can crack these troublesome codes, what can they do with all of this data? Well, they can trace the people they actually suspect of crimes. But theoretically, they could do that without including the rest of us. Of course, they could (and probably will) continue to hunt for the patterns of potential terrorists, though the prospects don't look too promising.

It might be more fruitful to focus datamining on well established patterns, and on crimes that more of us commit, like tax evasion. If that happens, the small coterie of privacy advocates worried about the turnkey totalitarian state will quickly gain millions of angry supporters.





add comment link to post share:

One of the Numerati goes to work for Obama
March 9, 2012Datamining


Photo: BigStock.com

I remember talking to Rayid Ghani about "barnacles."
Those are the shoppers who go to extraordinary lengths to buy only things on sale. Ghani, a researcher at Accenture Labs, told me that stores could identify likely barnacles in their data, and then perhaps take measures to "fire" them, ie. send them shopping elsewhere. Barnacles, after all, cost money.

At the time, Ghani and his team were analyzing loads of supermarket data, and trying to figure out how to lead shoppers toward the items they'd most likely buy (and others the supermarket wanted to get rid of). This research would go into the "Shopper" chaper in The Numerati, in which Ghani was the lead character.

Now, as the NYTimes reports, Ghani is chief scientist for the Obama campaign. The mission is to unearth different tribes of voters from the analysis of data, and then figure out the best way to "optimize" them, as donors, organizers, or just plain voters. If Ghani had made this switch earlier, I could have featured him in the Voter chapter. But that's one of the hallmarks of the Numerati. Their skills enable them to switch from one field to the next.

add comment link to post share:

Big Data and math
February 12, 2012Datamining

Steve Lohr has a good round-up of Big Data trends in the Times. It has very similar themes to The Numerati and to Ian Ayres' SuperCrunchers.

In fact, reading the story reminded of my BusinessWeek story, Math Will Rock Your World, which led to The Numerati. It argues the same points, but instead of focusing on the subject of the investigation--data--it looks at the tools employed, mathematics and computers (without, you might note, shedding any light on how the mathematicians and computer scientists carry out this work.)

As I've mentioned before, Steve Adler, the editor in chief, started the process by telling me to write a cover story on math. So I started interviewing mathematicians. I was learning all sorts of interesting things about encryption and operations research, but I didn't really see the BusinessWeek cover story until I delved into the world of data. In the end, I wrote a story about Big Data--but kept math in the headline.

A few paragraphs from that story:

The world is moving into a new age of numbers. Partnerships between mathematicians and computer scientists are bulling into whole new domains of business and imposing the efficiencies of math. This has happened before. In past decades, the marriage of higher math and computer modeling transformed science and engineering. Quants turned finance upside down a generation ago. And data miners plucked useful nuggets from vast consumer and business databases. But just look at where the mathematicians are now. They're helping to map out advertising campaigns, they're changing the nature of research in newsrooms and in biology labs, and they're enabling marketers to forge new one-on-one relationships with customers. As this occurs, more of the economy falls into the realm of numbers. Says James R. Schatz, chief of the mathematics research group at the National Security Agency: "There has never been a better time to be a mathematician."

From fledglings like Inform to tech powerhouses such as IBM (IBM ), companies are hitching mathematics to business in ways that would have seemed fanciful even a few years ago. In the past decade, a sizable chunk of humanity has moved its work, play, chat, and shopping online. We feed networks gobs of digital data that once would have languished on scraps of paper -- or vanished as forgotten conversations. These slices of our lives now sit in databases, many of them in the public domain. From a business point of view, they're just begging to be analyzed. But even with the most powerful computers and abundant, cheap storage, companies can't sort out their swelling oceans of data, much less build businesses on them, without enlisting skilled mathematicians and computer scientists.

The rise of mathematics is heating up the job market for luminary quants, especially at the Internet powerhouses where new math grads land with six-figure salaries and rich stock deals. Tom Leighton, an entrepreneur and applied math professor at Massachusetts Institute of Technology, says: "All of my students have standing offers at Yahoo!  and Google." Top mathematicians are becoming a new global elite. It's a force of barely 5,000, by some guesstimates, but every bit as powerful as the armies of Harvard University MBAs who shook up corner suites a generation ago.


add comment link to post share:

Baseball playoffs begin: Moneyball season over
September 30, 2011Datamining


                               Phillies' pitcher Roy Halladay: Statistics indicate that he could win... or lose

It's time to forget Moneyball and statistical analysis
. The 162-game baseball season, the six-month marathon in which statistics have the time to work their magic, is over. As play-offs begin, managers might as well return to their divining rods or the study of patterns on the bottom of their coffee cups. They're entering a season defined largely by unmeasurables such as confidence, feel, and most importantly, luck.

The Phillies-Cardinals match-up Saturday pits Roy Halladay, last year's Cy Young winner, against Kyle Lohse, a mid-rotation starter through his career. Lohse, through an 11-year career, has won about as many as he has lost. Halladay, in 14 years, has won two games for every game he's lost. This year he went 19-6. Still, he lost six games, and could lose another against Lohse. The game might turn on one pitch and the difference of a quarter inch on Albert Pujols' bat. That tiny adjustment turns a high fly ball into a tape-measure homer. Over a long season, Halladay establishes his superiority. In one game, or even a five-game series, throw the stats out the window.

And yet, because of the new Moneyball movie, we're sure to hear every day about the wonders of baseball's quants, led by the prime number-cruncher, Bill James. In fact, it already surfaced in the recent pennant races. In one article, summing up the chances of the fading Boston Red Sox, a Harvard data cruncher named Andrew Mooney surveyed the last ten days of the season and counseled the Sox not to worry:

You’re in a funk, you say. You’ve lost nine of your last eleven, while the Rays have won eight of 10. Actually, this could be just as much a source of comfort as a cause for alarm. Simple probabilities indicate that neither team is likely to continue at such a rate for the remainder of the season; that’s just the nature of streaks. Need evidence? You started the season 2-10.

You’ve got a 10-game homestand coming (winning percentage at Fenway: .592), while the Rays will be away for their next 11 (winning percentage on the road: .557).

My question: What do "simple probabilities" count for when one team is consumed by dread and its rival buoyed by rising confidence? How do you measure the impact of such things? And what are such measurements worth in a sample of only 10 games? Nothing, I'd say. 

By instituting two series of play-offs, Major League Baseball essentially created a second season. While the first is a marathon, in which statistics rule, the second is an eight-team sprint. The best team can win this second season. But for this to happen, it must also be hot and lucky.

Just one more point about the movie Moneyball. The analysis in the movie boiled baseball down to its essense: The team that scores more runs than its opponents over 162 games will likely wind up on top. But for some reason, the film glided over the biggest factor in this equation: starting pitching. It didn't mention even once that Billy Beane's Oakland A's had the best trio of starting pitchers in the Major Leagues. Barry Zito won the Cy Young Award that year. He was absent from the movie. Mark Mulder and Tim Hudson were magnificent. We got brief glimpses of their uniforms.

So the movie would lead us to believe that the A's won their division that year because they played a converted catcher at first base, swung a smart deal for a left-handed reliever, got rid of a distracting Giambi brother, and urged players to take walks. The A's could have done all of those things, and without their trio of great pitchers, they would have ended up with a losing record. That's why, as The Sconz says: Moneyball is a lie.


add comment link to post share:

You will be monitored, step by step
September 7, 2011Datamining

I think I should have been more emphatic in The Numerati. From the 2020s on, practically every senior in the industrial world will be monitored by sensors in his or her home. That includes me and, if you're not already in your golden years, it includes you. My "Patient" chapter in the book focused on Intel's efforts to monitor seniors in the Portland area, research that also spread into Ireland. National economies desperately need to save money on health care, and developing technology that can monitor seniors and intervene before they get sick, and before they fall, is simply too sensible to pass up.

Now I get this news about a similar initiative in Missouri.One interesting wrinkle is the use of technology originally developed for gaming, Microsoft's Kinect. And they use the depth perception of the camera to view the subjects as silhouettes, which protects a bit of their privacy. You can assume that if you don't reach your 70s until, say, the 2030s, the technology will be remarkably effective and, at the same time, discreet. Then again, once it works well on older people, why not extend it to everyone else? That's the future I see.

MU Researchers Use New Video Gaming Technology to Detect Illness, Prevent Falls in Older Adults from MU News Bureau on Vimeo.


add comment link to post share:

How much data can Asthma inhalers provide?
May 11, 2011Datamining

It seems like the perfect combination for asthma research: inhalers equipped with GPS, so that each use of the medication comes with a time and place tag. Teradata's Paul Barsch  cites an Economist article about Asthmopolis, the new tool to track asthma.

The idea is that people can map their own patterns, and come to understand, and hopefully avoid, places and conditions that provoke asthma attacks. And researchers, studying the data from thousands of users, might learn even more.

That's where other variables come in. The GPS data may show that 20 users in Youngstown, Ohio, suffer exacerbations between 9 p.m. and 10 p.m. on a Tuesday night. But how many of them are just traveling through Youngstown in their car? Should they count? And how many of them are spending the evening in close quarters with their cats or dogs? What did people eat? Since each of us is a complex system, the challenge with medical monitoring is to pick up as much detail of the entire life as possible.

The key--at least until English-savvy machines like Watson are on the case--is to get valuable diary data into formats machines can process. Then systems like Asthmopolis could really make a difference, both for individuals and society at large. The other key, as Barsch notes, is to do this in a way that protects people's privacy. None of the medical monitoring will work without that.

add comment link to post share:




©2013 Stephen Baker Media, All rights reserved.     Site by Infinet Design



















RT @sbauman: Report Uncovers Huge Business Opportunities in Healthcare Data Analytics http://t.co/VMlk3s6bZ7

follow me on twitter





Kirkus - Kirkus Reviews

Andrew Dunn - Bloomberg News

Culture Mob - Dan Sampson

Shelfari (Amazon) - Tom Nissley

read more reviews





Why Nate Silver is never wrong
- November 8, 2012


The psychology behind bankers' hatred for Obama
- September 10, 2012


"Corporations are People": an op-ed
- August 16, 2011


Wall Street Journal excerpt: Final Jeopardy
- February 4, 2011


Why IBM's Watson is Smarter than Google
- January 9, 2011


Rethinking books
- October 3, 2010


The coming privacy boom
- August 17, 2010


The appeal of virtual
- May 18, 2010


My next book: IBM's Jeopardy mission
- March 22, 2010


BusinessWeek's strategy
- November 12, 2009


BusinessWeek cannot afford to stay within McGraw-Hill
- August 6, 2009


How to remake BusinessWeek?
- July 16, 2009