Free Data Sources for Machine Learning and AI (685)

Free Data Sources for Machine Learning and AI (685)

  • Uncategorized

Curing diseases. Building advanced robotic prosthetic arms. Turning smart cities into safe cities. Revolutionizing analytics….. There are now infinite problems that can be addressed with powerful Artificial Intelligence (AI) solutions and its sub-divisions like Machine Learning, Natural Language Processing, Robotics, Vision, Deep Learning, and more. All this because the enabling technology is now very cheap and available on-demand. Besides the hardware AI projects need a massive amount of labelled data. Generating labelled datasets for any particular use-case is costly and is also pretty labor-intensive. Fortunately, there are well-curated free data sources for machine learning domain-centric public dataset repositories available online which can really help you to power your machine learning work/data science project/deep learning project.

These datasets are available on a wide range of topics – from agriculture, computer networks to transportation and neuroscience. We understand that spending your R&D time and effort on organizing data is sub-optimal. Hence, we have put together a list of datasets from myriad sources such as blogs, answers, peer-reviewed academic journals, user responses, and other publications. Most of them are free for public use, however, some are available for a price.

So, here is the list of open datasets to dig high-quality data:

Agriculture

  1. U.S. Department of Agriculture’s PLANTS Database
  2. U.S. Department of Agriculture’s Nutrient Database

Automotive

  1. Knoema
  2. Automobile Dataset
  3. Automobile|Kaggle
  4. Used cars
  5. Car fuel consumptions and emissions 2000-2013
  6. Open Datasets–DIY Robocars

Artificial Datasets

  1. Arcade universe
  2. Baby AI Shape Datasets
  3. BabyAIImageAndQuestionDatasets
  4. MnistVariations
  5. RectanglesData
  6. ConvexNonConvex
  7. BackgroundCorrelation

Biology

  1. 1000 Genomes
  2. American Gut (Microbiome Project)
  3. Broad Bioimage Benchmark Collection (BBBC)
  4. Broad Cancer Cell Line Encyclopedia (CCLE)
  5. Cell Image Library
  6. Complete Genomics Public Data
  7. EBI ArrayExpress
  8. EBI Protein Data Bank in Europe
  9. Electron Microscopy Pilot Image Archive (EMPIAR)
  10. ENCODE project
  11. Ensembl Genomes
  12. Gene Expression Omnibus (GEO)
  13. Gene Ontology (GO)
  14. Global Biotic Interactions (GloBI)
  15. Harvard Medical School (HMS) LINCS Project
  16. Human Genome Diversity Project
  17. Human Microbiome Project (HMP)
  18. ICOS PSP Benchmark
  19. International HapMap Project
  20. Journal of Cell Biology DataViewer
  21. MIT Cancer Genomics Data
  22. NCBI Proteins
  23. NCBI Taxonomy
  24. NCI Genomic Data Commons
  25. NIH Microarray data or FTP (see FTP link on RAW)
  26. OpenSNP genotypes data
  27. Pathguid – Protein-Protein Interactions Catalog
  28. Protein Data Bank
  29. Psychiatric Genomics Consortium
  30. PubChem Project
  31. PubGene (now Coremine Medical)
  32. Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)
  33. Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)
  34. Sequence Read Archive(SRA)
  35. Stanford Microarray Data
  36. Stowers Institute Original Data Repository
  37. Systems Science of Biological Dynamics (SSBD) Database
  38. The Cancer Genome Atlas (TCGA), available via Broad GDAC
  39. The Catalogue of Life
  40. The Personal Genome Project or PGP
  41. UCSC Public Data
  42. UniGene
  43. Universal Protein Resource (UnitProt)

Exploratory Analysis

  1. Game of Thrones
  2. World University Ranking
  3. IMDB 5000 Movie Dataset

Cloud Machine Learning

  1. AWS Public Datasets
  2. Google Cloud Public Datasets
  3. Microsoft Azure Public Datasets

Climate Weather

  1. Actuaries Climate Index
  2. Australian Weather
  3. Aviation Weather Center – Consistent, timely and accurate weather information for the world airspace system
  4. Brazilian Weather – Historical data (In Portuguese)
  5. Canadian Meteorological Centre
  6. Climate Data from UEA (updated monthly)
  7. European Climate Assessment & Dataset
  8. Global Climate Data Since 1929
  9. NASA Global Imagery Browse Services
  10. NOAA Bering Sea Climate
  11. NOAA Climate Datasets
  12. NOAA Realtime Weather Models
  13. NOAA SURFRAD Meteorology and Radiation Datasets
  14. The World Bank Open Data Resources for Climate Change
  15. UEA Climatic Research Unit
  16. US weather history
  17. WorldClim – Global Climate Data
  18. WU Historical Weather Worldwide

Complex Networks

  1. AMiner Citation Network Dataset
  2. CrossRef DOI URLs
  3. DBLP Citation dataset
  4. DIMACS Road Networks Collection
  5. NBER Patent Citations
  6. Network Repository with Interactive Exploratory Analysis Tools
  7. NIST complex networks data collection
  8. Protein-protein interaction network
  9. PyPI and Maven Dependency Network
  10. Scopus Citation Database
  11. Small Network Data
  12. Stanford GraphBase (Steven Skiena)
  13. Stanford Large Network Dataset Collection
  14. Stanford Longitudinal Network Data Sources
  15. The Koblenz Network Collection
  16. The Laboratory for Web Algorithmics (UNIMI)
  17. The Nexus Network Repository
  18. UCI Network Data Repository
  19. UFL sparse matrix collection
  20. WSU Graph Database

Computer Networks

  1. 3.5B Web Pages from CommonCrawl 2012
  2. 53.5B Web clicks of 100K users in Indiana Univ.
  3. CAIDA Internet Datasets
  4. ClueWeb09 – 1B web pages
  5. ClueWeb12 – 733M web pages
  6. CommonCrawl Web Data over 7 years
  7. CRAWDAD Wireless datasets from Dartmouth Univ.
  8. Criteo click-through data
  9. OONI: Open Observatory of Network Interference – Internet censorship data
  10. Open Mobile Data by MobiPerf
  11. Rapid7 Sonar Internet Scans
  12. UCSD Network Telescope, IPv4 /8 net

Data Challenges

  1. ACM KDD CUP
  2. Bruteforce Database
  3. Challenges in Machine Learning
  4. CrowdANALYTIX dataX
  5. D4D Challenge of Orange
  6. Data–Repository– Causality workbench
  7. DrivenData Competitions for Social Good
  8. ICWSM Data Challenge (since 2009)
  9. Kaggle Competition Data
  10. KDD Cup by Tencent 2012
  11. Localytics Data Visualization Challenge
  12. Netflix Prize
  13. Space Apps Challenge
  14. Telecom Italia Big Data Challenge
  15. TravisTorrent Dataset – MSR’2017 Mining Challenge
  16. TunedIT – Data mining & machine learning data sets, algorithms, challenges
  17. Yelp Dataset Challenge

Data Journals & hubs

  1. Data-artikelen
  2. Data journalism and data visualization from the Datablog
  3. Data Publica
  4. Federal Surveillance Planes
  5. Firearm background checks
  6. White House staff salaries
  7. Radiation Analysis
  8. Workplace fatalities by US state
  9. Archive-it
  10. Google Public Data Explorer
  11. Welcome-the data hub
  12. Data Sets- Agg Data
  13. Find and Purchase Data Subscriptions
  14. Factual Home
  15. Socrata
  16. Data Export – Prosper

Deep Learning

  1. MNIST
  2. CIFAR
  3. Caltech 101
  4. Caltech 256
  5. STL-10 dataset
  6. The Street View House Numbers (SVHN)
  7. NORB
  8. Pascal VOC
  9. Labelme
  10. LSUN
  11. MS COCO
  12. COIL 20
  13. COIL100 
  14. Google’s Open Images
  15. ImageNet
  16. Open Source Biometric Recognition Data
  17. Google Audioset
  18. Uber 2B trip data
  19. YouTube 8M

Earth Science

  1. AQUASTAT – Global water resources and uses
  2. BODC – marine data of ~22K vars
  3. Earth Models
  4. EOSDIS – NASA’s earth observing system data
  5. Integrated Marine Observing System (IMOS) – roughly 30TB of ocean measurements or on S3
  6. Marinexplore – Open Oceanographic Data
  7. Smithsonian Institution Global Volcano and Eruption Database
  8. USGS Earthquake Archives

Economics

  1. American Economic Association (AEA)
  2. EconData from UMD
  3. Economic Freedom of the World Data
  4. Historical MacroEconomc Statistics
  5. International Economics Database and various data tools
  6. International Trade Statistics
  7. Internet Product Code Database
  8. Joint External Debt Data Hub
  9. Jon Haveman International Trade Data Links
  10. OpenCorporates Database of Companies in the World
  11. Our World in Data
  12. SciencesPo World Trade Gravity Datasets
  13. The Atlas of Economic Complexity
  14. The Center for International Data
  15. The Observatory of Economic Complexity
  16. UN Commodity Trade Statistics
  17. UN Human Development Reports

Education

  1. Educational Statistics
  2. College Scorecard Data
  3. Student Data from Free Code Camp
  4. Enron emails
  5. Student learning factors
  6. News articles

Energy

  1. AMPds
  2. BLUEd
  3. COMBED
  4. Dataport
  5. DRED
  6. ECO
  7. EIA
  8. HES – Household Electricity Study, UK
  9. HFED
  10. iAWE
  11. PLAID – the Plug Load Appliance Identification Dataset
  12. REDD
  13. Tracebase
  14. UK-DALE – UK Domestic Appliance-Level Electricity
  15. WHITED

Finance

  1. CBOE Futures Exchange
  2. Google Finance
  3. Google Trends
  4. NASDAQ
  5. NYSE Market Data (see FTP link on RAW)
  6. Federal Agency Participation | Data.gov
  7. FRB: Data Download Program (DDP)
  8. OANDA
  9. Lending Club Statistics – Lending Club
  10. OSU Financial data
  11. Quandl
  12. St Louis Federal
  13. Yahoo Finance
  14. World Development Indicators
  15. World Bank project costs

Facial Datasets

  1. Labelled Faces in the Wild
  2. UMD Faces
  3. CASIA WebFace
  4. MS-Celeb-1M
  5. Olivetti
  6. Multi-Pie
  7. Face-in-Action
  8. JACFEE
  9. FERET
  10. mmifacedb
  11. IndianFaceDatabase
  12. The Yale Face Database
  13. The Yale Face Database B

GIS

  1. ArcGIS Open Data portal
  2. Cambridge, MA, US, GIS data on GitHub
  3. Factual Global Location Data
  4. Geo Spatial Data from ASU
  5. Geo Wiki Project – Citizen-driven Environmental Monitoring
  6. GeoFabrik – OSM data extracted to a variety of formats and areas
  7. GeoNames Worldwide
  8. Global Administrative Areas Database (GADM)
  9. Homeland Infrastructure Foundation-Level Data
  10. Landsat 8 on AWS
  11. List of all countries in all languages
  12. National Weather Service GIS Data Portal
  13. Natural Earth – vectors and rasters of the world
  14. NEXRAD
  15. OpenAddresses
  16. OpenStreetMap (OSM)
  17. Pleiades – Gazetteer and graph of ancient places
  18. Reverse Geocoder using OSM data & additional high-resolution data files
  19. TIGER/Line – U.S. boundaries and roads
  20. TwoFishes – Foursquare’s coarse geocoder
  21. TZ Timezones shapfiles
  22. UN Environmental Data
  23. World boundaries from the U.S. Department of State
  24. World countries in multiple formats

Government & Statistics Data

  1. A list of cities and countries contributed by community
  2. Data USA: The most comprehensive visualization of US public data
  3. EU Gender statistics database
  4. The Netherlands’ Nationaal Georegister(Dutch)
  5. United Nations Development Programme Projects
  6. Open Data for Africa
  7. OpenDataSoft’s list of 1,600 open data
  8. Portal de Obligaciones de Transparencia
  9. Junta de Andalucía – Datos abiertos
  10. Reutilización de la Información del Sector Público | Reutilización de la Información de los
  11. Servicios Públicos
  12. Portal de Datos Abiertos de JCCM
  13. Ayuntamiento de Zaragoza. Datos de Zaragoza Reutilización
  14. Dades obertes Lleida – Ajuntament de Lleida
  15. ISTAC | El ISTAC
  16. Dades Obertes. Generalitat de Catalunya
  17. Dades Obertes CAIB
  18. Reutilización de la Información del Sector Público en Gijón
  19. Open Data Euskadi ataria, Eusko Jaurlaritzaren datu publikoen irekitzea
  20. Data for Hawaii | data.hawaii.gov
  21. Florida Has A Right To Know
  22. Georgia.gov
  23. Commonwealth Data Point
  24. Open Data | data.maryland.gov
  25. Connecticut Transparency Website
  26. gov: Open Data
  27. NYS Data Center
  28. gov DataShare
  29. State of Alabama – Open.alabama.gov
  30. Open Government for the State of Tennessee
  31. gov | Government | State Facts and History
  32. OpenDoor – Kentucky
  33. Illinois.gov | Open Illinois
  34. SOM – Michigan Data Store
  35. mo.gov | State of Missouri Data Portal
  36. DATAshare | data.iowa.gov
  37. Minnesota open data // your portal for Minnesota data transparency
  38. Open Data Texas
  39. Welcome to Oklahoma’s Official Web Site
  40. KanView: Kansas Transparency Taxpayer Act – Kansas Revenues and Expenditures Search
  41. OPEN SD :: South Dakota Government Information
  42. North Dakota GIS (Geographic Information Systems)
  43. State Government Data New Mexico
  44. gov: The Official State Web Portal
  45. Arizona OpenBooks | – Arizona Transparency Finances in Detail
  46. Utah Data – Utah.gov
  47. CA.gov | Data Transparency for the State of California
  48. Oregon Data | Opening Oregon’s Data
  49. Washington | Washington State’s Data Site
  50. Home | Data.gov
  51. Portal de Datos Públicos – Inicio
  52. gub.uy | Portal del Estado Uruguayo
  53. Bem vindo – Portal Brasileiro de Dados Abertos
  54. Directorio de Empresas, Marcas registradas, Normas legales y Teléfonos en Perú
  55. ie – The Portal to Ireland’s Official Statistics
  56. gov.be | The Belgian open data initiative
  57. overheid.nl: het open dataportaal van de Nederlandse overheid
  58. Statistical database
  59. Offene Daten Österreich | data.gv.at
  60. Vitajte – data.gov.sk
  61. gov.it | I dati aperti della PA
  62. Δημοσια, Ανοικτά Δεδομένα
  63. SAUDI | National e-Government Portal – Home
  64. govt.nz – New Zealand government data online » Data.govt.nz
  65. gov.au
  66. 국가공유자원포털
  67. Open Data Canada
  68. ru
  69. OpenAid – Start
  70. norge.no | Åpne offentlige data i Norge – Difi
  71. Portada | datos.gob.es
  72. Open Data Colombia
  73. home | data.gov.uk

Healthcare

  1. EHDP Large Health Data Sets
  2. EU Surveillance Atlas of Infectious Diseases
  3. Gapminder World demographic databases
  4. GDC supports several cancer genome programs for CCG, TCGA, TARGET etc.
  5. PhysioBank Databases – a large and growing archive of physiological data
  6. Medicare Coverage Database (MCD), U.S.
  7. Medicare Data Engine of medicare.gov Data
  8. Medicare Data File
  9. Merck Molecular Activity Challenge
  10. MeSH, the vocabulary thesaurus used for indexing articles for PubMed
  11. Musk dataset
  12. Number of Ebola Cases and Deaths in Affected Countries (2014)
  13. Open-ODS (structure of the UK NHS)
  14. OpenPaymentsData, Healthcare financial relationship data
  15. The Cancer Genome Atlas project (TCGA) (refer to GDCand BigQuery table)
  16. Study Drugs
  17. World Health Organization Global Health Observatory
  18. Zika Virus

Image Processing

  1. 10k US Adult Faces Database
  2. 2GB of Photos of Cats or Archive version
  3. Adience Unfiltered faces for gender and age classification
  4. Affective Image Classification
  5. Animals with attributes
  6. Caltech Pedestrian Detection Benchmark
  7. Chars74K dataset, Character Recognition in Natural Images (both English and Kannada are available)
  8. Face Recognition Benchmark
  9. Flickr: 32 Class Brand Logos
  10. GDXray: X-ray images for X-ray testing and Computer Vision
  11. ImageNet (in WordNet hierarchy)
  12. Indoor Scene Recognition
  13. International Affective Picture System, UFL
  14. Massive Visual Memory Stimuli, MIT
  15. MNIST database of handwritten digits, near 1 million examples
  16. Several Shape-from-Silhouette Datasets
  17. Stanford Dogs Dataset
  18. SUN database, MIT
  19. The Action Similarity Labeling (ASLAN) Challenge
  20. The Oxford-IIIT Pet Dataset
  21. Violent-Flows – Crowd Violence Non-violence Database and benchmark
  22. Visual genome
  23. YouTube Faces Database

Machine Learning

  1. Book-Crossing dataset
  2. Context-aware data sets from five domains
  3. Delve Datasets for classification and regression (Univ. of Toronto)
  4. Discogs Monthly Data
  5. eBay Online Auctions (2012)
  6. IMDb Database
  7. Jester
  8. Keel Repository for classification, regression and time series
  9. Labeled Faces in the Wild (LFW)
  10. Lending Club Loan Data
  11. fm
  12. Machine Learning Data Set Repository
  13. Free Music Archive
  14. Million Song Dataset
  15. More Song Datasets
  16. MovieLens Data Sets
  17. New Yorker caption contest ratings
  18. RDataMining – “R and Data Mining” ebook data
  19. Registered Meteorites on Earth
  20. Restaurants Health Score Data in San Francisco
  21. UCI Machine Learning Repository
  22. Yahoo! Ratings and Classification Data
  23. Youtube 8m
  24. Wine Quality (Regression)
  25. Credit Card Default (Classification)
  26. US Census Data (Clustering)

Museums

  1. Canada Science and Technology Museums Corporation’s Open Data
  2. Cooper-Hewitt’s Collection Database
  3. Minneapolis Institute of Arts metadata
  4. Natural History Museum (London) Data Portal
  5. Rijksmuseum Historical Art Collection
  6. Tate Collection metadata
  7. The Getty vocabularies

Natural Language

  1. Amazon Reviews
  2. Automatic Keyphrase Extraction
  3. Blogger Corpus
  4. CLiPS Stylometry Investigation Corpus
  5. ClueWeb09 FACC
  6. ClueWeb12 FACC
  7. DBpedia – 4.58M things with 583M facts
  8. Enron Dataset
  9. Flickr Personal Taxonomies
  10. com of people, places, and things
  11. Google Books Ngrams (2.2TB)
  12. Google MC-AFP, generated based on the public available Gigaword dataset using Paragraph Vectors
  13. Google Web 5gram (1TB, 2006)
  14. Gutenberg eBooks List
  15. Hansards text chunks of Canadian Parliament
  16. Machine Comprehension Test (MCTest) of text from Microsoft Research
  17. Machine Translation of European languages
  18. Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
  19. Multi-Domain Sentiment Dataset (version 2.0)
  20. Newsgroup Classification
  21. Open Multilingual Wordnet
  22. Personae Corpus
  23. SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
  24. SMS Spam Collection in English
  25. Universal Dependencies
  26. USENET postings corpus of 2005~2011
  27. Webhose – News/Blogs in multiple languages
  28. Wikidata – Wikipedia databases
  29. Wikipedia Links data – 40 Million Entities in Context
  30. WordNet databases and tools

Networks and Graphs

  1. Amazon Co-Purchasing
  2. Friendster Social Network Dataset

Physics

  1. CERN Open Data Portal
  2. Crystallography Open Database
  3. NASA Exoplanet Archive
  4. NSSDC (NASA) data of 550 space spacecraft
  5. Sloan Digital Sky Survey (SDSS) – Mapping the Universe

Public domains

  1. Amazon
  2. Archive-it from Internet Archive
  3. org Datasets
  4. CMU JASA data archive
  5. CMU StatLab collections
  6. World
  7. Data360
  8. Google
  9. Infochimps
  10. KDNuggets Data Collections
  11. Microsoft Azure Data Market Free DataSets
  12. Microsoft Data Science for Research
  13. Numbray
  14. Open Library Data Dumps
  15. Reddit Datasets
  16. RevolutionAnalytics Collection
  17. Sample R data sets
  18. Stats4Stem R data sets
  19. org
  20. The Washington Post List
  21. UCLA SOCR data collection
  22. UFO Reports
  23. Wikileaks 911 pager intercepts
  24. Yahoo Webscope
  25. IMF Data and Statistics
  26. Data | The World Bank
  27. Stat
  28. UNdata
  29. Data and maps — European Environment Agency (EEA)
  30. Eurostat Home

Phycology/ Cognition

  1. OSU Cognitive Modeling Repository Datasets

Recommender Systems

  1. MovieLens
  2. Jester
  3. Million Song Dataset

Question Answering

  1. Maluuba News QA Dataset
  2. Quora Question Pairs
  3. CMU Q/A Dataset
  4. Maluuba goal-oriented dialogue
  5. bAbi
  6. The Children’s Book Test

Neuroscience

  1. Allen Institute Datasets
  2. Brain Catalogue
  3. Brainomics
  4. CodeNeuro Datasets
  5. Collaborative Research in Computational Neuroscience (CRCNS)
  6. FCP-INDI
  7. Human Connectome Project
  8. NDAR
  9. NeuroData
  10. Neuroelectro
  11. NIMH Data Archive
  12. OASIS
  13. OpenfMRI
  14. Study Forrest

Search Engines

  1. Academic Torrents of data sharing from UMB
  2. io
  3. DataMarket (Qlik)
  4. Harvard Dataverse Network of scientific data
  5. ICPSR (UMICH)
  6. Institute of Education Sciences
  7. National Technical Reports Library
  8. Open Data Certificates (beta)
  9. OpenDataNetwork – A search engine of all Socrata powered data portals
  10. com – statistics and Studies
  11. Zenodo – An open dependable home for the long-tail of science
  12. Zanran Numerical Data Search
  13. Quandl – Intelligent Search for Numerical Data

Sentiment

  1. Multidomain sentiment analysis dataset
  2. IMDB
  3. Stanford Sentiment Treebank

Social Sciences

  1. ACLED (Armed Conflict Location & Event Data Project)
  2. Canadian Legal Information Institute
  3. Center for Systemic Peace Datasets – Conflict Trends, Polities, State Fragility, etc
  4. Correlates of War Project
  5. Cryptome Conspiracy Theory Items
  6. Datacards
  7. European Social Survey
  8. FBI Hate Crime 2013 – aggregated data
  9. Fragile States Index
  10. GDELT Global Events Database
  11. General Social Survey (GSS) since 1972
  12. German Social Survey
  13. Global Religious Futures Project
  14. Humanitarian Data Exchange
  15. INFORM Index for Risk Management
  16. Institute for Demographic Studies
  17. International Networks Archive
  18. International Social Survey Program ISSP
  19. International Studies Compendium Project
  20. James McGuire Cross National Data
  21. MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste
  22. Minnesota Population Center
  23. MIT Reality Mining Dataset
  24. Notre Dame Global Adaptation Index (NG-DAIN)
  25. Open Crime and Policing Data in England, Wales and Northern Ireland
  26. Paul Hensel General International Data Page
  27. PewResearch Internet Survey Project
  28. PewResearch Society Data Collection
  29. Political Polarity Data
  30. StackExchange Data Explorer
  31. Terrorism Research and Analysis Consortium
  32. Texas Inmates Executed Since 1984
  33. Titanic Survival Data Set or on Kaggle
  34. UCB’s Archive of Social Science Data (D-Lab)
  35. UCLA Social Sciences Data Archive
  36. UN Civil Society Database
  37. Universities Worldwide
  38. UPJOHN for Labor Employment Research
  39. Uppsala Conflict Data Program
  40. World Bank Open Data
  41. WorldPop project – Worldwide human population distributions

Social Networks

  1. 72 hours #gamergate Twitter Scrape
  2. Ancestry.com Forum Dataset over 10 years
  3. Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape
  4. CMU Enron Email of 150 users
  5. EDRM Enron EMail of 151 users, hosted on S3
  6. Facebook Data Scrape (2005)
  7. Facebook Social Networks from LAW (since 2007)
  8. Foursquare from UMN/Sarwat (2013)
  9. GitHub Collaboration Archive
  10. Google Scholar citation relations
  11. High-Resolution Contact Networks from Wearable Sensors
  12. Indie Map: social graph and crawl of top IndieWeb sites
  13. Mobile Social Networks from UMASS
  14. Network Twitter Data
  15. Reddit Comments
  16. Skytrax’ Air Travel Reviews Dataset
  17. Social Twitter Data
  18. SourceForge.net Research Data
  19. Twitter Data for Online Reputation Management
  20. Twitter Data for Sentiment Analysis
  21. Twitter Graph of entire Twitter site
  22. Twitter Scrape Calufa May 2011
  23. UNIMI/LAW Social Network Datasets
  24. Yahoo! Graph and Social Data
  25. Youtube Video Social Graph in 2007,2008

Software

  1. FLOSSmole data about free, libre, and open source software development

Speech Datasets

  1. 2000 HUB5 English
  2. LibriSpeech
  3. VoxForge
  4. TIMIT
  5. CHIME
  6. TED-LIUM

Sports

  1. Basketball (NBA/NCAA/Euro) Player Database and Statistics
  2. Betfair Historical Exchange Data
  3. Cricsheet Matches (cricket)
  4. Ergast Formula 1, from 1950 up to date (API)
  5. Football/Soccer resources (data and APIs)
  6. Lahman’s Baseball Database
  7. Pinhooker: Thoroughbred Bloodstock Sale Data
  8. Retrosheet Baseball Statistics
  9. Tennis database of rankings, results, and stats for ATPWTAGrand Slams and Match Charting Project

Streaming

  1. Twitter API
  2. StockTwits API
  3. Weather Underground

Symbolic Music Datasets

  1. Piano-midi.de: classical piano pieces
  2. Nottingham : over 1000 folk tunes
  3. MuseData: electronic library of classical music scores
  4. JSB Chorales: set of four-part harmonized chorales

Text Datasets

  1. 20 newsgroups
  2. Reuters News dataset
  3. Penn Treebank
  4. UCI’s Spambase
  5. Broadcast News
  6. Text Classification Datasets
  7. WikiText
  8. SQuAD
  9. Billion Words dataset
  10. Common Crawl
  11. Google Books Ngrams

Time Series

  1. EOD Stock Prices
  2. Databanks International Cross National Time Series Data Archive
  3. Global Education Statistics
  4. Hard Drive Failure Rates
  5. Heart Rate Time Series from MIT
  6. Time Series Data Library (TSDL) from MU
  7. UC Riverside Time Series Dataset
  8. Zillow Real Estate Research

Transportation

  1. Airlines OD Data 1987-2008
  2. Airline Safety
  3. Bay Area Bike Share Data
  4. Bike Share Systems (BSS) collection
  5. GeoLife GPS Trajectory from Microsoft Research
  6. German train system by Deutsche Bahn
  7. Hubway Million Rides in MA
  8. Marine Traffic – ship tracks, port calls and more
  9. Montreal BIXI Bike Share
  10. NYC Taxi Trip Data 2009-
  11. NYC Taxi Trip Data 2013 (FOIA/FOILed)
  12. NYC Uber trip data April 2014 to September 2014
  13. Open Traffic collection
  14. OpenFlights – airport, airline and route data
  15. Oregon Climate Data
  16. Philadelphia Bike Share Stations (JSON)
  17. Plane Crash Database, since 1920
  18. RITA Airline On-Time Performance data
  19. RITA/BTS transport data collection (TranStat)
  20. Toronto Bike Share Stations (XML file)
  21. Transport for London (TFL)
  22. Travel Tracker Survey (TTS) for Chicago
  23. U.S. Bureau of Transportation Statistics (BTS)
  24. U.S. Domestic Flights 1990 to 2009
  25. U.S. Freight Analysis Framework since 2007

Web Scrapping

  1. ToScrape.com

Other assorted collection

  1. Advanced NFL Stats: Play-by-Play Data
  2. Brodatz dataset: texture modeling
  3. 300 terabytes of high-quality data from the Large Hadron Collider (LHC) at CERN
  4. CMU Motion Capture Database
  5. Criteo click stream dataset
  6. Cancer Program Data Sets
  7. Cosm–Explore
  8. Data Packaged Core Datasets
  9. Database of Scientific Code Contributions
  10. Doing Research in New York City Public Schools and Requesting Data – NYC Data – New York City Department of Education
  11. A growing collection of public datasets:
  12. Awesome Public Datasets
  13. gov
  14. Data Tools – Locators
  15. Data Sets | Pew Research Center’s Internet & American Life Project
  16. DataWrangling: Some Datasets Available on the Web
  17. Data | GeoDa Center
  18. Europeana Professional – Linked Open Data
  19. Frequent Itemset Mining Dataset Repository
  20. Google Ngram Viewer
  21. Inside-r: Finding Data on the Internet
  22. Inforum – EconData
  23. net Research Data
  24. Summary of Data Sets by Application Area
  25. NYC Taxi dataset
  26. Online Data – Robert Shiller
  27. OpenDataMonitor: An overview of available open data resources in Europe
  28. Quora: Where can I find large datasets open to the public?
  29. io: 100+ Interesting Data Sets for Statistics
  30. StaTrek: Leveraging open data to understand urban lives
  31. Uber FOIL dataset
  32. awesome-awesomeness
  33. sindresorhus’s awesome

I hope this well-curated list of public datasets focused on wide-ranging industry segment will add huge value to your mission-critical projects. It would be great to hear your thoughts and ideas on how you will be going to use these public data sources in your crucial discoveries, anticipating projects or niche.  You can share your comments just below this post.

Reference Guide:

Close Menu