Curing diseases. Building advanced robotic prosthetic arms. Turning smart cities into safe cities. Revolutionizing analytics….. There are now infinite problems that can be addressed with powerful Artificial Intelligence (AI) solutions and its sub-divisions like Machine Learning, Natural Language Processing, Robotics, Vision, Deep Learning, and more. All this because the enabling technology is now very cheap and available on-demand. Besides the hardware AI projects need a massive amount of labelled data. Generating labelled datasets for any particular use-case is costly and is also pretty labor-intensive. Fortunately, there are well-curated free data sources for machine learning domain-centric public dataset repositories available online which can really help you to power your machine learning work/data science project/deep learning project.
These datasets are available on a wide range of topics – from agriculture, computer networks to transportation and neuroscience. We understand that spending your R&D time and effort on organizing data is sub-optimal. Hence, we have put together a list of datasets from myriad sources such as blogs, answers, peer-reviewed academic journals, user responses, and other publications. Most of them are free for public use, however, some are available for a price.
So, here is the list of open datasets to dig high-quality data:
Agriculture
Automotive
- Knoema
- Automobile Dataset
- Automobile|Kaggle
- Used cars
- Car fuel consumptions and emissions 2000-2013
- Open Datasets–DIY Robocars
Artificial Datasets
- Arcade universe
- Baby AI Shape Datasets
- BabyAIImageAndQuestionDatasets
- MnistVariations
- RectanglesData
- ConvexNonConvex
- BackgroundCorrelation
Biology
- 1000 Genomes
- American Gut (Microbiome Project)
- Broad Bioimage Benchmark Collection (BBBC)
- Broad Cancer Cell Line Encyclopedia (CCLE)
- Cell Image Library
- Complete Genomics Public Data
- EBI ArrayExpress
- EBI Protein Data Bank in Europe
- Electron Microscopy Pilot Image Archive (EMPIAR)
- ENCODE project
- Ensembl Genomes
- Gene Expression Omnibus (GEO)
- Gene Ontology (GO)
- Global Biotic Interactions (GloBI)
- Harvard Medical School (HMS) LINCS Project
- Human Genome Diversity Project
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- International HapMap Project
- Journal of Cell Biology DataViewer
- MIT Cancer Genomics Data
- NCBI Proteins
- NCBI Taxonomy
- NCI Genomic Data Commons
- NIH Microarray data or FTP (see FTP link on RAW)
- OpenSNP genotypes data
- Pathguid – Protein-Protein Interactions Catalog
- Protein Data Bank
- Psychiatric Genomics Consortium
- PubChem Project
- PubGene (now Coremine Medical)
- Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)
- Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)
- Sequence Read Archive(SRA)
- Stanford Microarray Data
- Stowers Institute Original Data Repository
- Systems Science of Biological Dynamics (SSBD) Database
- The Cancer Genome Atlas (TCGA), available via Broad GDAC
- The Catalogue of Life
- The Personal Genome Project or PGP
- UCSC Public Data
- UniGene
- Universal Protein Resource (UnitProt)
Exploratory Analysis
Cloud Machine Learning
Climate Weather
- Actuaries Climate Index
- Australian Weather
- Aviation Weather Center – Consistent, timely and accurate weather information for the world airspace system
- Brazilian Weather – Historical data (In Portuguese)
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- European Climate Assessment & Dataset
- Global Climate Data Since 1929
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- NOAA SURFRAD Meteorology and Radiation Datasets
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- US weather history
- WorldClim – Global Climate Data
- WU Historical Weather Worldwide
Complex Networks
- AMiner Citation Network Dataset
- CrossRef DOI URLs
- DBLP Citation dataset
- DIMACS Road Networks Collection
- NBER Patent Citations
- Network Repository with Interactive Exploratory Analysis Tools
- NIST complex networks data collection
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Small Network Data
- Stanford GraphBase (Steven Skiena)
- Stanford Large Network Dataset Collection
- Stanford Longitudinal Network Data Sources
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- The Nexus Network Repository
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
Computer Networks
- 3.5B Web Pages from CommonCrawl 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- ClueWeb09 – 1B web pages
- ClueWeb12 – 733M web pages
- CommonCrawl Web Data over 7 years
- CRAWDAD Wireless datasets from Dartmouth Univ.
- Criteo click-through data
- OONI: Open Observatory of Network Interference – Internet censorship data
- Open Mobile Data by MobiPerf
- Rapid7 Sonar Internet Scans
- UCSD Network Telescope, IPv4 /8 net
Data Challenges
- ACM KDD CUP
- Bruteforce Database
- Challenges in Machine Learning
- CrowdANALYTIX dataX
- D4D Challenge of Orange
- Data–Repository– Causality workbench
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- Kaggle Competition Data
- KDD Cup by Tencent 2012
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- TravisTorrent Dataset – MSR’2017 Mining Challenge
- TunedIT – Data mining & machine learning data sets, algorithms, challenges
- Yelp Dataset Challenge
Data Journals & hubs
- Data-artikelen
- Data journalism and data visualization from the Datablog
- Data Publica
- Federal Surveillance Planes
- Firearm background checks
- White House staff salaries
- Radiation Analysis
- Workplace fatalities by US state
- Archive-it
- Google Public Data Explorer
- Welcome-the data hub
- Data Sets- Agg Data
- Find and Purchase Data Subscriptions
- Factual Home
- Socrata
- Data Export – Prosper
Deep Learning
- MNIST
- CIFAR
- Caltech 101
- Caltech 256
- STL-10 dataset
- The Street View House Numbers (SVHN)
- NORB
- Pascal VOC
- Labelme
- LSUN
- MS COCO
- COIL 20
- COIL100
- Google’s Open Images
- ImageNet
- Open Source Biometric Recognition Data
- Google Audioset
- Uber 2B trip data
- YouTube 8M
Earth Science
- AQUASTAT – Global water resources and uses
- BODC – marine data of ~22K vars
- Earth Models
- EOSDIS – NASA’s earth observing system data
- Integrated Marine Observing System (IMOS) – roughly 30TB of ocean measurements or on S3
- Marinexplore – Open Oceanographic Data
- Smithsonian Institution Global Volcano and Eruption Database
- USGS Earthquake Archives
Economics
- American Economic Association (AEA)
- EconData from UMD
- Economic Freedom of the World Data
- Historical MacroEconomc Statistics
- International Economics Database and various data tools
- International Trade Statistics
- Internet Product Code Database
- Joint External Debt Data Hub
- Jon Haveman International Trade Data Links
- OpenCorporates Database of Companies in the World
- Our World in Data
- SciencesPo World Trade Gravity Datasets
- The Atlas of Economic Complexity
- The Center for International Data
- The Observatory of Economic Complexity
- UN Commodity Trade Statistics
- UN Human Development Reports
Education
- Educational Statistics
- College Scorecard Data
- Student Data from Free Code Camp
- Enron emails
- Student learning factors
- News articles
Energy
- AMPds
- BLUEd
- COMBED
- Dataport
- DRED
- ECO
- EIA
- HES – Household Electricity Study, UK
- HFED
- iAWE
- PLAID – the Plug Load Appliance Identification Dataset
- REDD
- Tracebase
- UK-DALE – UK Domestic Appliance-Level Electricity
- WHITED
Finance
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- NYSE Market Data (see FTP link on RAW)
- Federal Agency Participation | Data.gov
- FRB: Data Download Program (DDP)
- OANDA
- Lending Club Statistics – Lending Club
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- World Development Indicators
- World Bank project costs
Facial Datasets
- Labelled Faces in the Wild
- UMD Faces
- CASIA WebFace
- MS-Celeb-1M
- Olivetti
- Multi-Pie
- Face-in-Action
- JACFEE
- FERET
- mmifacedb
- IndianFaceDatabase
- The Yale Face Database
- The Yale Face Database B
GIS
- ArcGIS Open Data portal
- Cambridge, MA, US, GIS data on GitHub
- Factual Global Location Data
- Geo Spatial Data from ASU
- Geo Wiki Project – Citizen-driven Environmental Monitoring
- GeoFabrik – OSM data extracted to a variety of formats and areas
- GeoNames Worldwide
- Global Administrative Areas Database (GADM)
- Homeland Infrastructure Foundation-Level Data
- Landsat 8 on AWS
- List of all countries in all languages
- National Weather Service GIS Data Portal
- Natural Earth – vectors and rasters of the world
- NEXRAD
- OpenAddresses
- OpenStreetMap (OSM)
- Pleiades – Gazetteer and graph of ancient places
- Reverse Geocoder using OSM data & additional high-resolution data files
- TIGER/Line – U.S. boundaries and roads
- TwoFishes – Foursquare’s coarse geocoder
- TZ Timezones shapfiles
- UN Environmental Data
- World boundaries from the U.S. Department of State
- World countries in multiple formats
Government & Statistics Data
- A list of cities and countries contributed by community
- Data USA: The most comprehensive visualization of US public data
- EU Gender statistics database
- The Netherlands’ Nationaal Georegister(Dutch)
- United Nations Development Programme Projects
- Open Data for Africa
- OpenDataSoft’s list of 1,600 open data
- Portal de Obligaciones de Transparencia
- Junta de Andalucía – Datos abiertos
- Reutilización de la Información del Sector Público | Reutilización de la Información de los
- Servicios Públicos
- Portal de Datos Abiertos de JCCM
- Ayuntamiento de Zaragoza. Datos de Zaragoza Reutilización
- Dades obertes Lleida – Ajuntament de Lleida
- ISTAC | El ISTAC
- Dades Obertes. Generalitat de Catalunya
- Dades Obertes CAIB
- Reutilización de la Información del Sector Público en Gijón
- Open Data Euskadi ataria, Eusko Jaurlaritzaren datu publikoen irekitzea
- Data for Hawaii | data.hawaii.gov
- Florida Has A Right To Know
- Georgia.gov
- Commonwealth Data Point
- Open Data | data.maryland.gov
- Connecticut Transparency Website
- gov: Open Data
- NYS Data Center
- gov DataShare
- State of Alabama – Open.alabama.gov
- Open Government for the State of Tennessee
- gov | Government | State Facts and History
- OpenDoor – Kentucky
- Illinois.gov | Open Illinois
- SOM – Michigan Data Store
- mo.gov | State of Missouri Data Portal
- DATAshare | data.iowa.gov
- Minnesota open data // your portal for Minnesota data transparency
- Open Data Texas
- Welcome to Oklahoma’s Official Web Site
- KanView: Kansas Transparency Taxpayer Act – Kansas Revenues and Expenditures Search
- OPEN SD :: South Dakota Government Information
- North Dakota GIS (Geographic Information Systems)
- State Government Data New Mexico
- gov: The Official State Web Portal
- Arizona OpenBooks | – Arizona Transparency Finances in Detail
- Utah Data – Utah.gov
- CA.gov | Data Transparency for the State of California
- Oregon Data | Opening Oregon’s Data
- Washington | Washington State’s Data Site
- Home | Data.gov
- Portal de Datos Públicos – Inicio
- gub.uy | Portal del Estado Uruguayo
- Bem vindo – Portal Brasileiro de Dados Abertos
- Directorio de Empresas, Marcas registradas, Normas legales y Teléfonos en Perú
- ie – The Portal to Ireland’s Official Statistics
- gov.be | The Belgian open data initiative
- overheid.nl: het open dataportaal van de Nederlandse overheid
- Statistical database
- Offene Daten Österreich | data.gv.at
- Vitajte – data.gov.sk
- gov.it | I dati aperti della PA
- Δημοσια, Ανοικτά Δεδομένα
- SAUDI | National e-Government Portal – Home
- govt.nz – New Zealand government data online » Data.govt.nz
- gov.au
- 국가공유자원포털
- Open Data Canada
- ru
- OpenAid – Start
- norge.no | Åpne offentlige data i Norge – Difi
- Portada | datos.gob.es
- Open Data Colombia
- home | data.gov.uk
Healthcare
- EHDP Large Health Data Sets
- EU Surveillance Atlas of Infectious Diseases
- Gapminder World demographic databases
- GDC supports several cancer genome programs for CCG, TCGA, TARGET etc.
- PhysioBank Databases – a large and growing archive of physiological data
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- Merck Molecular Activity Challenge
- MeSH, the vocabulary thesaurus used for indexing articles for PubMed
- Musk dataset
- Number of Ebola Cases and Deaths in Affected Countries (2014)
- Open-ODS (structure of the UK NHS)
- OpenPaymentsData, Healthcare financial relationship data
- The Cancer Genome Atlas project (TCGA) (refer to GDCand BigQuery table)
- Study Drugs
- World Health Organization Global Health Observatory
- Zika Virus
Image Processing
- 10k US Adult Faces Database
- 2GB of Photos of Cats or Archive version
- Adience Unfiltered faces for gender and age classification
- Affective Image Classification
- Animals with attributes
- Caltech Pedestrian Detection Benchmark
- Chars74K dataset, Character Recognition in Natural Images (both English and Kannada are available)
- Face Recognition Benchmark
- Flickr: 32 Class Brand Logos
- GDXray: X-ray images for X-ray testing and Computer Vision
- ImageNet (in WordNet hierarchy)
- Indoor Scene Recognition
- International Affective Picture System, UFL
- Massive Visual Memory Stimuli, MIT
- MNIST database of handwritten digits, near 1 million examples
- Several Shape-from-Silhouette Datasets
- Stanford Dogs Dataset
- SUN database, MIT
- The Action Similarity Labeling (ASLAN) Challenge
- The Oxford-IIIT Pet Dataset
- Violent-Flows – Crowd Violence Non-violence Database and benchmark
- Visual genome
- YouTube Faces Database
Machine Learning
- Book-Crossing dataset
- Context-aware data sets from five domains
- Delve Datasets for classification and regression (Univ. of Toronto)
- Discogs Monthly Data
- eBay Online Auctions (2012)
- IMDb Database
- Jester
- Keel Repository for classification, regression and time series
- Labeled Faces in the Wild (LFW)
- Lending Club Loan Data
- fm
- Machine Learning Data Set Repository
- Free Music Archive
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- New Yorker caption contest ratings
- RDataMining – “R and Data Mining” ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- Youtube 8m
- Wine Quality (Regression)
- Credit Card Default (Classification)
- US Census Data (Clustering)
Museums
- Canada Science and Technology Museums Corporation’s Open Data
- Cooper-Hewitt’s Collection Database
- Minneapolis Institute of Arts metadata
- Natural History Museum (London) Data Portal
- Rijksmuseum Historical Art Collection
- Tate Collection metadata
- The Getty vocabularies
Natural Language
- Amazon Reviews
- Automatic Keyphrase Extraction
- Blogger Corpus
- CLiPS Stylometry Investigation Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia – 4.58M things with 583M facts
- Enron Dataset
- Flickr Personal Taxonomies
- com of people, places, and things
- Google Books Ngrams (2.2TB)
- Google MC-AFP, generated based on the public available Gigaword dataset using Paragraph Vectors
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- Machine Comprehension Test (MCTest) of text from Microsoft Research
- Machine Translation of European languages
- Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
- Multi-Domain Sentiment Dataset (version 2.0)
- Newsgroup Classification
- Open Multilingual Wordnet
- Personae Corpus
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- SMS Spam Collection in English
- Universal Dependencies
- USENET postings corpus of 2005~2011
- Webhose – News/Blogs in multiple languages
- Wikidata – Wikipedia databases
- Wikipedia Links data – 40 Million Entities in Context
- WordNet databases and tools
Networks and Graphs
Physics
- CERN Open Data Portal
- Crystallography Open Database
- NASA Exoplanet Archive
- NSSDC (NASA) data of 550 space spacecraft
- Sloan Digital Sky Survey (SDSS) – Mapping the Universe
Public domains
- Amazon
- Archive-it from Internet Archive
- org Datasets
- CMU JASA data archive
- CMU StatLab collections
- World
- Data360
- Infochimps
- KDNuggets Data Collections
- Microsoft Azure Data Market Free DataSets
- Microsoft Data Science for Research
- Numbray
- Open Library Data Dumps
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- Stats4Stem R data sets
- org
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- IMF Data and Statistics
- Data | The World Bank
- Stat
- UNdata
- Data and maps — European Environment Agency (EEA)
- Eurostat Home
Phycology/ Cognition
Recommender Systems
Question Answering
- Maluuba News QA Dataset
- Quora Question Pairs
- CMU Q/A Dataset
- Maluuba goal-oriented dialogue
- bAbi
- The Children’s Book Test
Neuroscience
- Allen Institute Datasets
- Brain Catalogue
- Brainomics
- CodeNeuro Datasets
- Collaborative Research in Computational Neuroscience (CRCNS)
- FCP-INDI
- Human Connectome Project
- NDAR
- NeuroData
- Neuroelectro
- NIMH Data Archive
- OASIS
- OpenfMRI
- Study Forrest
Search Engines
- Academic Torrents of data sharing from UMB
- io
- DataMarket (Qlik)
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Institute of Education Sciences
- National Technical Reports Library
- Open Data Certificates (beta)
- OpenDataNetwork – A search engine of all Socrata powered data portals
- com – statistics and Studies
- Zenodo – An open dependable home for the long-tail of science
- Zanran Numerical Data Search
- Quandl – Intelligent Search for Numerical Data
Sentiment
Social Sciences
- ACLED (Armed Conflict Location & Event Data Project)
- Canadian Legal Information Institute
- Center for Systemic Peace Datasets – Conflict Trends, Polities, State Fragility, etc
- Correlates of War Project
- Cryptome Conspiracy Theory Items
- Datacards
- European Social Survey
- FBI Hate Crime 2013 – aggregated data
- Fragile States Index
- GDELT Global Events Database
- General Social Survey (GSS) since 1972
- German Social Survey
- Global Religious Futures Project
- Humanitarian Data Exchange
- INFORM Index for Risk Management
- Institute for Demographic Studies
- International Networks Archive
- International Social Survey Program ISSP
- International Studies Compendium Project
- James McGuire Cross National Data
- MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste
- Minnesota Population Center
- MIT Reality Mining Dataset
- Notre Dame Global Adaptation Index (NG-DAIN)
- Open Crime and Policing Data in England, Wales and Northern Ireland
- Paul Hensel General International Data Page
- PewResearch Internet Survey Project
- PewResearch Society Data Collection
- Political Polarity Data
- StackExchange Data Explorer
- Terrorism Research and Analysis Consortium
- Texas Inmates Executed Since 1984
- Titanic Survival Data Set or on Kaggle
- UCB’s Archive of Social Science Data (D-Lab)
- UCLA Social Sciences Data Archive
- UN Civil Society Database
- Universities Worldwide
- UPJOHN for Labor Employment Research
- Uppsala Conflict Data Program
- World Bank Open Data
- WorldPop project – Worldwide human population distributions
Social Networks
- 72 hours #gamergate Twitter Scrape
- Ancestry.com Forum Dataset over 10 years
- Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape
- CMU Enron Email of 150 users
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- Foursquare from UMN/Sarwat (2013)
- GitHub Collaboration Archive
- Google Scholar citation relations
- High-Resolution Contact Networks from Wearable Sensors
- Indie Map: social graph and crawl of top IndieWeb sites
- Mobile Social Networks from UMASS
- Network Twitter Data
- Reddit Comments
- Skytrax’ Air Travel Reviews Dataset
- Social Twitter Data
- SourceForge.net Research Data
- Twitter Data for Online Reputation Management
- Twitter Data for Sentiment Analysis
- Twitter Graph of entire Twitter site
- Twitter Scrape Calufa May 2011
- UNIMI/LAW Social Network Datasets
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
Software
Speech Datasets
Sports
- Basketball (NBA/NCAA/Euro) Player Database and Statistics
- Betfair Historical Exchange Data
- Cricsheet Matches (cricket)
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resources (data and APIs)
- Lahman’s Baseball Database
- Pinhooker: Thoroughbred Bloodstock Sale Data
- Retrosheet Baseball Statistics
- Tennis database of rankings, results, and stats for ATP, WTA, Grand Slams and Match Charting Project
Streaming
Symbolic Music Datasets
- Piano-midi.de: classical piano pieces
- Nottingham : over 1000 folk tunes
- MuseData: electronic library of classical music scores
- JSB Chorales: set of four-part harmonized chorales
Text Datasets
- 20 newsgroups
- Reuters News dataset
- Penn Treebank
- UCI’s Spambase
- Broadcast News
- Text Classification Datasets
- WikiText
- SQuAD
- Billion Words dataset
- Common Crawl
- Google Books Ngrams
Time Series
- EOD Stock Prices
- Databanks International Cross National Time Series Data Archive
- Global Education Statistics
- Hard Drive Failure Rates
- Heart Rate Time Series from MIT
- Time Series Data Library (TSDL) from MU
- UC Riverside Time Series Dataset
- Zillow Real Estate Research
Transportation
- Airlines OD Data 1987-2008
- Airline Safety
- Bay Area Bike Share Data
- Bike Share Systems (BSS) collection
- GeoLife GPS Trajectory from Microsoft Research
- German train system by Deutsche Bahn
- Hubway Million Rides in MA
- Marine Traffic – ship tracks, port calls and more
- Montreal BIXI Bike Share
- NYC Taxi Trip Data 2009-
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- NYC Uber trip data April 2014 to September 2014
- Open Traffic collection
- OpenFlights – airport, airline and route data
- Oregon Climate Data
- Philadelphia Bike Share Stations (JSON)
- Plane Crash Database, since 1920
- RITA Airline On-Time Performance data
- RITA/BTS transport data collection (TranStat)
- Toronto Bike Share Stations (XML file)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
Web Scrapping
Other assorted collection
- Advanced NFL Stats: Play-by-Play Data
- Brodatz dataset: texture modeling
- 300 terabytes of high-quality data from the Large Hadron Collider (LHC) at CERN
- CMU Motion Capture Database
- Criteo click stream dataset
- Cancer Program Data Sets
- Cosm–Explore
- Data Packaged Core Datasets
- Database of Scientific Code Contributions
- Doing Research in New York City Public Schools and Requesting Data – NYC Data – New York City Department of Education
- A growing collection of public datasets:
- Awesome Public Datasets
- gov
- Data Tools – Locators
- Data Sets | Pew Research Center’s Internet & American Life Project
- DataWrangling: Some Datasets Available on the Web
- Data | GeoDa Center
- Europeana Professional – Linked Open Data
- Frequent Itemset Mining Dataset Repository
- Google Ngram Viewer
- Inside-r: Finding Data on the Internet
- Inforum – EconData
- net Research Data
- Summary of Data Sets by Application Area
- NYC Taxi dataset
- Online Data – Robert Shiller
- OpenDataMonitor: An overview of available open data resources in Europe
- Quora: Where can I find large datasets open to the public?
- io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives
- Uber FOIL dataset
- awesome-awesomeness
- sindresorhus’s awesome
I hope this well-curated list of public datasets focused on wide-ranging industry segment will add huge value to your mission-critical projects. It would be great to hear your thoughts and ideas on how you will be going to use these public data sources in your crucial discoveries, anticipating projects or niche. You can share your comments just below this post.
Reference Guide:
- github.com
- Kaggle Datasets
- r/datasets
- UCI Machine Learning Repository
- Deeplearning.net
- DeepLearning4J.org
- Quora Answer
- nlp-datasets (Github)
- AWS Public Datasets
- Google Cloud Public Datasets
- Microsoft Azure Public Datasets
- The World Bank
- Quandl
- entaroadun (Github)
- Satori
- FiveThirtyEight
- BuzzFeedNews
- Sargasso
- The Guardian