writi..... ...Full Bio
How banks can save time by using AI
50 Data Science, Data sets that are more than amusing, Part-1
Some of us are drowning in data, most of us are oblivious, and some lucky few are surfing on it. We can do things that we couldn't in the past. And that got me wondering just what other interesting data sets are out there? This research is put together and is sort of a guided tour, a curated list if you will.
Here's my attempt of 50 datasets, at making it all just a bit more manageable.
- Ever get a morbid curiosity about what it's like to be on death row?But in case you ever have, Texas has graciously placed the last words of every inmate executed since 1984 online. So sentiment analysis, anyone?
Speaking of prison, there's more data on prisoners, including information about "their current offense and sentence, criminal history, family background and personal characteristics, prior drug and alcohol use and treatment programs, gun possession and use, and prison activities, programs, and services available here.
How about reading other peoples emails? Ever wanted to do that, but can't be bothered to train l33t hacking skills? Well, I've got you covered. Check out the Enron corpus. It contains more than half a million emails from about 150 users, mostly senior management of Enron, organized into folders. Wikipedia calls it "unique in that it is one of the only publicly available mass collections of "real" emails easily available for study." Business idea: figure out what sort of information gets leaked in the emails that will later harm the execs at trial or whatever, then build a software system to automatically mine those out of real email. Either sell it to law enforcement or to corporate executives as the finest cover-your-ass email system.
Wondering what the internet really cares about? Well, I don't know about that, but you could answer an easier question: What does Reddit care about? Someone has scraped the top 2.5 million Reddit posts and then placed them on GitHub. Now you can figure out (with data!) just how much Redditors love cats. Or how about a data backed equivalent of r/circlejerk? (The original use case was determining what domains are the most popular.)
Speaking of cats, here are 10,000 annotated images of cats. This ought to come in handy whenever I get around to training a robot to exterminate all non-cat lifeforms.
- If you're interested in building financial algorithms or, really, just predicting arbitrage opportunities for one of America's largest cash crops, check out this data set, which tracks the price of marijuana from September 2nd, 2010 until about the present.
The earliest recorded chess match dates back to the 10th century, played between a historian from Baghdad and a student. Since then, it's become a tradition for moves to be recorded especially if a game has some significance, like a showdown between two strong players. As a consequence, today, students of the game benefit from one of the richest data sets of any game or sport. Perhaps the best freely available data set of games is known as the "Million Base," boasting some 2.2 million matches. You can download it here. I can imagine an app that calculates your chess fingerprint, letting you know what grandmaster your play is most similar to, or an analysis of how play style has changed over time.
On the topic of games, for soccer fans, I recently came across this freely available data set of soccer games, players, teams, goals, and more. If that's not enough, you can grab even more data via this Soccer metrics API python wrapper. I imagine that this could come in handy for coaches attempting to get an edge over opponent teams and, more generally, for that cross-section between geeks and gamblers attempting to build analytic models to make better bets.
Google has put made all their Google Books n-gram data freely available. An n-gram is an n word phrase, and the data set includes 1-grams through 5-grams. The data set is "based originally on 5.2 million books published between 1500 and 2008. I can imagine using it to determine the most overused, cliche phrases, and those phrases that are in danger of becoming cliched. (Quick! Someone register the domain clichealert.com!)
Amazon has a number of freely available data sets (although I think you need to run your analysis on top of their cloud, AWS), including more than 2.8 billion webpages courtesy Common Crawl. The possibilities are endless, but an old business idea I had: analyze the Common Crawl data and determine cheap or not-currently-registered domains which are, for whatever reason, linked to buy many websites. Buy these up and then resell them to people involved in SEO. (Or you could, you know, try to build the next Google.)
How well do minorities do on the computer science advanced placement exam? You can find out and tell me.
There's the Million Song data set, which contains information about a million different songs, including a metric "dance ability." Might be nice to pair that with a media player specialized for parties start with "conversation" music, and slowly shift to more dance able stuff as the night drags on. The data could also be used for a clustering algorithm (automatic genre detection, maybe), but I'm not sure how useful that'd be. A number of people have tried to build recommendation algorithms based on the data, including Kagglers and a team from Cornell. One possible use: analyzing music by year -How dance able, fast, etc. were the 70s? 80s? 90s? (Or how about looking for a follow-the-leader effect. If one song goes viral with a unique style, do a bunch of copycats follow?)
Speaking of music data sets, last.fm has music data available. Collected from ~360,000 users, it's in the form of "user, artists, ## of plays". This would be good for clustering algorithms that automatically determine label genre or recommender systems. (Even a "this artist is most similar to" thing would be sorta cool.)
When I think geeks, I think math and computer geeks, but there are many more. Terry Pratchett geeks (dated one!), Whovians, anime geeks, theater geeks and, with some relevance to this next data set, comic book geeks. Cesc Rossell, Ricardo Alberich, and Joe Miro have put together a "social graph" of the Marvel Universe, and the data is freely available. Ideas for use: Maybe it could be overlaid on Facebook's social graph to produce a new take on the What superhero are you? quiz.
Yelp has a freely available subset of their data, including restaurant rankings and reviews. One business idea: use tweets to predict restaurant star ratings. This would enable you to build out a Yelp competitor without requiring an active user base - you could just mine Twitter for data!
If you're interested in data about data (metadata!), Jrgen Sczler, a statistician from Google's public data team, has put together a list of the most frequently searched for data. The top 5 are school comparisons, unemployment, population, sales tax, and salaries. I was surprised that school comparisons were number 1 but, then again, I don't have any brats running around (yet?). This list would be a good first step in researching what sort of data comparisons people actually care about.
Some of my readers are, no doubt, evil geniuses. Others want to save the world. There's a subset of both of these groups who are interested in super intelligent robots. But to build such a robot, you're going to have to teach it facts. All the things we take for granted, like that every person has one father. It would be a pain to insert those 10 million facts by hand (and, at a fact a minute, take more than 19 years). Thankfully, Freebase has done part of the job for you, making more than 1.9 billion facts freely available.
Maybe your plans are slightly less ambitious. You don't want to build a super intelligent machine, just one smarter than your run of the mill mathematician. If that's the case, you're going to need to teach your machine a lot about mathematics, probably in the form of proofs and theorems. In that case, check out the Mizar project, which has formalized more than 9400 definitions and 49000 theorems.
And let's say you build this mathematician and, sure, it can help you with proofs, but so what? You long for someone you can connect with on a deeper level. Someone who can summarize any topic imaginable. In that case, you might want to feed your robot on Wikipedia data. While all of Wikipedia is freely available, DBpedia is an attempt to synthesize it into a more structured format.
Now, you get tired of mathematics and Wikipedia. It turns out that proofs don't pay the bills, so instead you decide to become a software engineer. Somehow, though, you've managed to build these machines without ever a rudimentary understanding of programming, and you want a machine that will teach it to you. But where to find the data for such a thing? You might start with downloading all 7.3 million StackOverflow questions.
Ever wanted to study true friendship? (C'mon! Free your inner <s>child</s> social scientist.) Ya know, genuine, platonic love, like the kind embodied by dolphins? Well, now you can! All thanks to your humble author and Mark Newman, who's placed a network of "frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand." Business idea: Flippr. It's like Facebook, but for dolphins, with plans to expand into emerging whale and sea turtle markets. Most revenue will come from sardine sales.
Do left-leaning blogs more often link to other left-leaning blogs than right-leaning ones? Well, I don't know, but it sounds reasonable. And, thanks to permission from Lada Adamic, you can download her network of hyperlinks between weblogs on US politics, recorded in 2005. (Or you could just read her paper. Spoilers: conservatives more freely link to other conservatives than liberals link to liberals so, if you're interested in link building, maybe you should register Republican.<a href="#citation-1"><sup>1</sup></a>
Who's friendlier: the average jazz musician or the average dolphin? You could find out by combining the dolphin data set mentioned earlier with Pablo M. Gleiser and Leon Danon's jazz musicians network data set.
What about 1930s southern women or prisoners? Who's friendlier? How about fraternity members or HAM radio operators? All this and more can be figured out with these network data sets.
How about dolphins or Slashdotters?
Web 2.0 websites (like Reddit) are sometimes gamed by "voting rings," which are groups of people that intentionally vote up each other's content, regardless of quality. I've often wondered if the same thing happens in academic circles. Like, you know, one night during your first year in grad school, you're kidnapped in the middle of the night and made to swear a blood oath that you'll cite every other member of the club. Or something. Well, Stanford has put online Arxiv's High Energy Physics paper citation network, so you could find out.
You read this blog, so you're pretty smart, right? And maybe you'd like to be rich, you know, so you can found the next Bill and Melinda Gates Foundation and save the world. (Because that's why you want to be rich, right?) Well, then maybe you ought to develop some new-fangled trading algorithm and pick up like a trillion pennies from in front of the metaphorical steam-roller that is the market. (Quantitative finance!) But, in such a case, you'd better at least test your strategy on historical market data. Market data which you can get here.
The Open Product Data website aims to make barcode data available for every brand for free. Business idea: a specialty tattoo parlor that only does barcode tattoos, but lets customers pick whatever product they want. Think about it: "What's your tattoo mean?" "It's a Twinkie barcode, because Twinkies last forever, man, just like my faith."
The European Center for Medium-Range Weather Forecasts has an impressive looking collection of weather data. Why, you ask, does the weather matter? The economic incentives for predicting the weather are absurd. When should you plant crops? Plan a big event? Launch a space shuttle? Go deep sea fishing? But I want to talk about the most fun application of weather data I'm aware of: The financial industry. I have a lot of respect for finance, mostly because of the crazy stuff they do. The only practical application of neutrinos I've heard of, for instance, is "because finance." Should your algorithm buy Indonesian sesame seed futures? With weather data, it might know.
- For a wordsmith, a good dictionary is indispensable, and when it comes to word data, you could do a lot worse than check out the freely available WordNet. WordNet has significant advantages over your run of the mill dictionary as it focuses on the structure of language, grouping words into "sets of cognitive synonyms (synsets), each expressing a distinct concept." It also has some information about relationships, such as "a chair has legs."
We've already established that some of you are evil geniuses, in which case, where are you going to build your secret lair? I mean, a volcano is pretty cool, but is it evil and genius enough for competing in today's modern world? You know what the other evil geniuses don't have? A secret base on a planet outside of the solar system.With NASA's list, you can get busy commissioning someone to build you a base on KOI-3284.01.<a href="#citation-2"><sup>2</sup></a>
The Federal Railroad administration keeps a list of "railroad safety information including accidents and incidents, inventory and highway-rail crossing data." Someone (like the NY Times) could overlay this on a map of the United States and figure out if people in poor regions are more likely to be hit by trains or something.
If you need a database of comprehensive book data, perhaps to build a competitor to Goodreads or an online digital library, the Open Library allows people to freely download their entire database.
Who is the United States killing with drones? If you're content with Pakistan specific data, there is a list of drone strikes available here.
If you're interested in building a Papers2 competitor with support for automatically importing citation data (please do this), CrossRef metadata search might be a good place to check out.
Mnemosyne is a virtual flash card program that takes advantage of spaced repetition to maximize learning. (As you might recall, I'm a big fan of spaced repetition.) The project has been collecting user data for years, and gwern has graciously agreed to freely host the data for a few months. Perhaps one could run some sort of unsupervised learning algorithm over it and try to discover heretofore unknown information about human memory.
How much would it cost to hire Justin Bieber to play at your wedding? The fine lads at Price economics have figured out how much it would cost to hire your favorite band. You could take this data and calculate some sort of popularity to price ratio. What's the most fame for your buck?
I've mentioned in a few of the other data sets just how lucrative it is to be able to better predict the stock market than everyone else. In 2011, researchers found that they could use data from twitter to do just that: they went through tweets, found one's related to publicly traded companies, and then calculated a mood score. With this they write, "We find an accuracy of 86.7% in predicting the daily up and down changes in the closing values of the DJIA." A number of Twitter data sets are freely available here.
A 2014 paper by Clifford Winston and Fred Mannering reports that vehicle traffic costs the United States 100 billion dollars each year.<a href="#citation-3"><sup>3</sup></a> There's money to be made, then, in routing traffic more efficiently. One way to do this would be to feed an algorithm historical traffic data and then use that to predict hotspots, which you would route people around. Lots of that data is available on data.gov.
On the other hand, if you were building an app to track current traffic data, you'll need a different data source.
If you want to launch a spam-fighting service, or maybe just analyze what type of emails spammers are sending, you'll need data. UC Irvine has you covered.
But maybe you want to extend your spam-fighting service to text messages. Still got you covered.
There is a wealth of data sets available for R and all you have to do is install a package. Ecdat is one of those packages, containing gobs of econometric data. How about an analysis of how math levels correlate with number of cigarettes smoked? I'd read that.
- Ever seen a TV show where a government determines that someone is a terrorist based on their social ties? I always figured that data would be locked down tight somewhere, y'know, classified. But it turns out it isn't. You, too, can analyze the social networks of terrorists.
There's been a fair bit of controversy around all the bureaucracy of Wikipedia. But how does one become a bona fide Wikipedia big shot? Who's the ideal Wikipedia administrator? Well, they're voted for, and the data is available for download.
Harvard has opened up its set of "over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials."
If you need small data sets for students, check out DASL. One at random: does sterilizing dominant males in a wild mustang population reduce the population?
GET-Evidence has put up public genomes for download. I think Steven Pinker's data is in there someone. Maybe you could make yourself a clone?
Oh, and speaking of genomes, the 1000 Genomes project has made 260 terabytes of genome data downloadable.