...
Full Bio
Use Machine Learning To Teach Robots to Navigate by CMU & Facebook Artificial Intelligence Research Team
228 days ago
Top 10 Artificial Intelligence & Data Science Master's Courses for 2020
229 days ago
Is Data Science Dead? Long Live Business Science
257 days ago
New Way to write code is about to Change: Join the Revolution
258 days ago
Google Go Language Future, Programming Language Programmer Will Get Best Paid Jobs
579 days ago
Top 10 Best Countries for Software Engineers to Work & High in-Demand Programming Languages
725289 views
Highest Paying Programming Language, Skills: Here Are The Top Earners
669417 views
Which Programming Languages in Demand & Earn The Highest Salaries?
474552 views
Top 5 Programming Languages Mostly Used By Facebook Programmers To Developed All Product
463743 views
World's Most Popular 5 Hardest Programming Language
395502 views
50+ Most Popular Free Open Datasets for Machine Learning
These free public datasets for a machine learning cheat sheet for high-quality datasets. These range from the vast (looking at you, Kaggle) or the highly specific (data for self-driving cars).
- A dataset shouldn't be messy, because you don't want to spend a lot of time cleaning data.
- A dataset shouldn't have too many rows or columns, so it's easy to work with.
- The cleaner the data, the better - cleaning a large data set can be very time to consume.
- There should be an interesting question that can be answered with the data.
- Data.gov: This site makes it possible to download data from multiple US government agencies. Data can range from government budgets to school performance scores. Be warned though: much of the data requires additional research.
- Food Environment Atlas: Contains data on how local food choices affect diet in the US.
- School system finances: A survey of the finances of school systems in the US.
- Chronic disease data: Data on chronic disease indicators in areas across the US.
- The US National Center for Education Statistics: Data on educational institutions and education demographics from the US and around the world.
- The UK Data Service: The UK's largest collection of social, economic and population data.
- Data USA: A comprehensive visualization of US public data.
- Quandl: A good source for economic and financial data â?? useful for building models to predict economic indicators or stock prices.
- World Bank Open Data: Datasets covering population demographics and a huge number of economic and development indicators from across the world.
- IMF Data: The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices, and investments.
- Financial Times Market Data: Up to date information on financial markets from around the world, including stock price indexes, commodities, and foreign exchange.
- Google Trends: Examine and analyze data on internet search activity and trending news stories around the world.
- American Economic Association (AEA): A good source to find US macroeconomic data.
- Labelme: A large dataset of annotated images.
- ImageNet: The de-facto image dataset for new algorithms. Is organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images.
- LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.)
- MS COCO: Generic image understanding and captioning.
- COIL100 : 100 different objects imaged at every angle in a 360 rotation.
- Visual Genome: Very detailed visual knowledge base with captioning of ~100K images.
- Google's Open Images: A collection of 9 million URLs to images â??that have been annotated with labels spanning over 6,000 categoriesâ?? under Creative Commons.
- Labelled Faces in the Wild: 13,000 labeled images of human faces, for use in developing applications that involve facial recognition.
- Stanford Dogs Dataset: Contains 20,580 images and 120 different dog breed categories.
- Indoor Scene Recognition: A very specific dataset, useful as most scene recognition models are better â??outside'. Contains 67 Indoor categories, and a total of 15620 images.
- Multidomain sentiment analysis dataset: A slightly older dataset that features product reviews from Amazon.
- IMDB reviews: An older, relatively small dataset for binary sentiment classification, features 25,000 movie reviews.
- Stanford Sentiment Treebank: Standard sentiment dataset with sentiment annotations.
- Sentiment140: A popular dataset, which uses 160,000 tweets with emoticons pre-removed.
- Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets
- Enron Dataset: Email data from the senior management of Enron, organized into folders.
- Amazon Reviews: Contains around 35 million reviews from Amazon spanning 18 years. Data include product and user information, ratings, and the plaintext review.
- Google Books Ngrams: A collection of words from Google books.
- Blogger Corpus: A collection of 681,288 blog posts gathered from blogger.com. Each blog contains a minimum of 200 occurrences of commonly used English words.
- Wikipedia Links data: The full text of Wikipedia. The dataset contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself.
- Gutenberg eBooks List: An Annotated list of ebooks from Project Gutenberg.
- Hansards text chunks of Canadian Parliament: 1.3 million pairs of texts from the records of the 36th Canadian Parliament.
- Jeopardy: Archive of more than 200,000 questions from the quiz show Jeopardy.
- SMS Spam Collection in English: A dataset that consists of 5,574 English SMS spam messages
- Yelp Reviews: An open dataset released by Yelp, contains more than 5 million reviews.
- UCI's Spambase: A large spam email dataset, useful for spam filtering.
- Berkeley DeepDrive BDD100k: Currently the largest dataset for self-driving AI. Contains over 100,000 videos of over 1,100-hour driving experiences across different times of the day and weather conditions. The annotated images come from New York and San Francisco areas.
- Baidu Apolloscapes: the Large dataset that defines 26 different semantic items such as cars, bicycles, pedestrians, buildings, street lights, etc.
- Comma.ai: More than 7 hours of highway driving. Details include car's speed, acceleration, steering angle, and GPS coordinates.
- Oxford's Robotic Car: Over 100 repetitions of the same route through Oxford, UK, captured over a period of a year. The dataset captures different combinations of weather, traffic, and pedestrians, along with long-term changes such as construction and roadworks.
- Cityscape Dataset: A large dataset that records urban street scenes in 50 different cities.
- CSSAD Dataset: This dataset is useful for perception and navigation of autonomous vehicles. The dataset skews heavily on roads found in the developed world.
- KUL Belgium Traffic Sign Dataset: More than 10000+ traffic sign annotations from thousands of physically distinct traffic signs in the Flanders region in Belgium.
- MIT AGE Lab: A sample of the 1,000+ hours of multi-sensor driving datasets collected at AgeLab.
- LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets: This dataset includes traffic signs, vehicles detection, traffic lights, and trajectory patterns.