As someone who’s trying to build skills around data visualization and data analysis, I’ve gotten somewhat frustrated at not being able to find data to work with. There’s a lot of publically available data out there, in formats that are intended to be useful, but it’s sometimes hard to find.
Here’s a small list of publically available data sources, of variable quality. DM me on Twitter or open a pull request if you have any sources you’d like to contribute!
This list is currently loosely sorted by ‘type’ of data.
- US Government-based sources
- World Development
- Crime and Justice
- Health
- Other
- Large Collections
- Fun things
- Paid (but cheap) data
US Government-based sources
- Data.gov - A repository for all data the government collects that is not private or restricted. A literal gold mine.
- increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”
- US Census - Obviously the first place you go to for demographic data about the US. Also contains useful little bits, such as geographic data.
- Geospatial Platform - Super useful if you’re interested in national level geospatial stuff. -_The GeoPlatform provides shared and trusted geospatial data, services, and applications for use by the public and by government agencies and partners to meet their mission needs._
- USDA Economic Research Service - All sorts of data related to agriculture and the agricultural trade in the US.
- The mission of USDA’s Economic Research Service is to anticipate trends and emerging issues in agriculture, food, the environment, and rural America and to conduct high-quality, objective economic research to inform and enhance public and private decision making
- National Agricultural Statistics Service -
- Timely, accurate, and useful statistics in service to U.S. agriculture
- NASA Prognostics Data Repository -
- A collection of data sets that have been donated by various universities, agencies, or companies. The data repository focuses exclusively on prognostic data sets, i.e., data sets that can be used for development of prognostic algorithms. Mostly these are time series of data from some nominal state to a failed state. The collection of data in this repository is an ongoing process.
- NCEI Land-based Station Data - A whole lot of weather data.
- Land-based observations are collected from instruments sited at locations on every continent. They include temperature, dew point, relative humidity, precipitation, wind speed and direction, visibility, atmospheric pressure, and types of weather occurrences such as hail, fog, and thunder.
- Rural-Urban Commuting Area Codes - Super interesting to look at other place data with this
- The rural-urban commuting area (RUCA) codes classify U.S. census tracts using measures of population density, urbanization, and daily commuting. The most recent RUCA codes are based on data from the 2010 decennial census and the 2006-10 American Community Survey. The classification contains two levels. Whole numbers (1-10) delineate metropolitan, micropolitan, small town, and rural commuting areas based on the size and direction of the primary (largest) commuting flows.
World Development
- World Bank Open Data - Free and open access to world development data
- OECD Data - OECD’s publication site. A myriad of databases of all sorts of development-related things.
- Humanitarian Data Exchange - “Humanitarian data”, meaning data relevant to humanitarian crises. This could include baseline data, demographic information, data about aid organizations.
- Yemen Data Project - A project aimed at collecting data about the conduct of war in Yemen. Data about airstrike frequency, targets, times.
- overall goal of contributing independent and neutral data to increase transparency over the conduct of the war and to inform humanitarian response, human rights advocacy, media coverage and policy discussion.
- PRIO - Peace Research Institute Oslo - PRIO makes their datasets publically available for replication and such. Lots of really good stuff in there.
- The Peace Research Institute Oslo (PRIO) conducts research on the conditions for peaceful relations between states, groups and people.
- Armed Conflict Location & Event Data Project - Extensive, regularly updated dataset on armed conflict across the world.
- Political violence and protest includes events that occur within civil wars and periods of instability, public protest and regime breakdown. ACLED’s aim is to capture the forms, actors, dates and locations of political violence and protest as it occurs across states. The ACLED team conducts analysis to describe, explore and test conflict scenarios, and makes both data and analysis open to freely use by the public.
- The DHS Program - nationally representative survey data for a bunch of countries. Really amazing source. HT to Tom Fish.
- World Bank - Health Equity and Financial Protection- This dataset contains information on health service coverage, health outcomes and financial protection from excess out-of-pocked medical spending at country level.
Crime and Justice
- National Archive of Criminal Justice Data - Great source for data on crime and justice.
- Uniform Crime Reporting Program Data: Offenses Known and Clearances by Arrest, 1960-2016 - Reported crime data from Jacob Kaplan of the University of Pennsylvania.
- a compilation of offenses reported to law enforcement agencies in the United States. Crimes included are criminal homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft.
- Police, Prostitution and Politics - State by state breakdown of prostitution arrests and child sex trafficking cases from 2000 to 2015. A thousand apologies, for this one is a PDF.
- American Violence - Contains data (that you can filter and download) on violence statistics in the US
- Public resource that will make data on violence accessible … , allowing users to visualize and analyze trends in violence at multiple geographic levels (neighborhoods and cities) and over different timeframes (month to month, year to year, decade to decade).
Health
- SAMHSA Data Collections - a variety of data related to mental health, including information about population, infrastructure, clients. (Government source, but felt like it belonged here)
- Data helps SAMHSA and the nation assess the impact of the changes to US health care systems and identify and address behavioral health disparities.
- CDC NHANES survey - cross-sectional survey of health & nutritional status of the US population. Make sure you understand the survey methods! (Government source, but felt like it belonged here)
- R package “outbreaks” - An R package compiling some publically available disease outbreak data. Useful for testing models and algorithms.
- Vanderbilt Biostats Datasets - a goldmine collection of datasets- conveniently formatted, too
- CDC WONDER - A bunch of good datasets. “Ad hoc query system”. This is the only place I found yearly STD data going back to 1995.
Other
- Homeland Infrastructure Foundation-Level Data (HIFLD) - Contains mostly infrastructure-based geospatial data (such as schools, hospitals, airplane runways, etc). Includes shapefiles.
- This site provides National foundation-level geospatial data within the open public domain that can be useful to support community preparedness, resiliency, research, and more.
- OCC Oil and Gas Data Files - Data related to mining/drilling activities
- International Institue of Tropical Agriculture (IITA) data repository - open access ag data for tropical agriculture
- UN Comtrade International trade data. It is a repository of official international trade statistics and relevant analytical tables. Easiest to get your data through an API. It’s got some neat visualizations as well.
- Data SF Open city data. Lots of other cities have this; you should check to see if your favorite city has a policy like this. I’m linking SF’s here because I’m a little bit biased.
Large Collections of Data
- EU Open Data Portal - Open data published by EU institutions and bodies. All sorts of fun things in here.
- Harvard Dataverse - Open source repository for research data. Take individual data sources here with a grain of salt, and make sure you really understand how the data was collected…
Fun things
- New York Public Library - What’s on the menu? - A giant collection of 100 years worth of restaurant menus.
Cheap Data
- Bureau of Safety and Environmental Enforcement Premade Data Sets - Relatively cheap (most sets under $35) data relating to seismic data + offshore drilling.
Other Peoples’ Lists
- Maaren van Smeden’s Open Data Repos - related to health, medicine, and epidemiology