Dataset list - A list of datasets and annotation tools

2021

A multitask benchmarking framework comprising complementary data modalities at a city-scale size, registered across different representations, and enriched with human and machine generated annotations. 27,745 high-resolution 360° images with human-curated annotations, 3D point clouds from: aerial and street-level LIDAR, Structure-from-Motion and Multiview-Stereo reconstructions, geo-anchored based on high-precision, survey-grade ground control points. Full aerial image cover with 7.5 cm/px resolution. Manually labeled 2D / 3D object annotations for up to 39 semantic categories.

2021

A dataset of building footprints to support social good applications. The dataset contains 516M building detections, across an area of 19.4M km2 (64% of the African continent).

2021

Facebook AI and Matterport have collaborated on the release of the largest-ever 3D dataset of indoor spaces made up of accurately-scaled residential and commercial spaces. The dataset consists of 3D Meshes and Textures of 1,000 Matterport spaces.

2021

The Unsplash Dataset is created by 250,000+ contributing photographers and billions of searches across thousands of applications, uses, and contexts. Lite version has 25.000 images, Full version has 3.000.000+ images.

2021

A large-scale dataset of 3D building models, contains 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models.

2021

A photorealistic synthetic dataset for holistic indoor scene understanding. 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.

2021

An ImageNet replacement for self-supervised pretraining without humans. PASS contains 1.4 million distinct images.

2021

A dataset of Amazon products with metadata, catalog images, and 3D models. 147,702 products and 398,212 unique catalog images in high resolution.

2021

Unlimited Road-scene Synthetic Annotation (URSA) Dataset, a synthetic dataset containing upwards of 1,000,000 images.

2021

https://github.com/HDCVLab/EDFace-Celeb-1M

2021

Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Casual Conversations is composed of over 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models.

2021

A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.

2021

The Mapillary Vistas Dataset is the most diverse publicly available dataset of manually annotated training data for semantic segmentation of street scenes. 25,000 images pixel-accurately labeled into 152 object categories, 100 of those instance-specific.

2021

The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description.

2021

With object trajectories and corresponding 3D maps for over 100,000 segments, each 20 seconds long and mined for interesting interactions, our new motion dataset contains more than 570 hours of unique data.

2021

TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.

2021

Contains spoken English commands for setting timers, setting alarms, unit conversions, and simple math. The dataset contains around ~2,200 spoken audio commands from 95 speakers, representing 2.5 hours of continuous audio.

2021

CaseHOLD contains 53,000 multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, which could be cited.

2021

Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review in connection with a corporate transaction, including mergers & acquisitions, etc.

2021

WebFace260M is a new million-scale face benchmark, which is constructed for the research community towards closing the data gap behind the industry.

2021

A billion-word corpus of Danish text, freely distributed with attribution.

2021

The ONCE dataset is a large-scale autonomous driving dataset with 2D&3D object annotations. Includes 1 Million LiDAR frames, 7 Million camera images.

2021

Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. It comprises a large set of 4006 images which are evenly distributed between fog, nighttime, rain, and snow.

2021

A Dataset of Sky Images and their Irradiance values. SkyCam dataset is a collection of sky images from a variety of locations with diverse topological characteristics (Swiss Jura, Plateau and Pre-Alps regions), from both single and stereo camera settings coupled with a high-accuracy pyranometers. The dataset was collected with a high frequency with a data sample every 10 seconds.

2021

A dataset for automatic mapping of buildings, woodlands, water and roads from aerial images.

2021

A dataset of “in the wild” portrait videos. The videos are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totaling up to 142 videos, 32 minutes, and 17 GBs.

2020

A novel dataset covering seasonal and challenging perceptual conditions for autonomous driving.

2020

This dataset contains 11,842,186 computer generated building footprints in all Canadian provinces and territories.

2020

MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28 * 28 images, which requires no background knowledge.

2020

Cube++ is a novel dataset collected for computational color constancy. It has 4890 raw 18-megapixel images, each containing a SpyderCube color target in their scenes, manually labelled categories, and ground truth illumination chromaticities.

2020

Large-scale Person Re-ID Dataset. SYSU-30k contains 29,606,918 images.

2020

Smithsonian Open Access, where you can download, share, and reuse millions of the Smithsonian’s images—right now, without asking. With new platforms and tools, you have easier access to more than 3 million 2D and 3D digital items.

2020

The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. Includes 15000 annotated videos and 4M annotated images.

2020

MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Consists of: 217,060 figures from 131,410 open access papers, 7507 subcaption and subfigure annotations for 2069 compound figures, Inline references for ~25K figures in the ROCO dataset.

2020

CLUE: A Chinese Language Understanding Evaluation Benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.

2020

Ruralscapes Dataset for Semantic Segmentation in UAV Videos. Ruralscapes is a dataset with 20 high quality (4K) videos portraying rural areas.

2020

Fashionpedia is a dataset which consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with 48k everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.

2020

Social Bias Inference Corpus (SBIC contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.

2020

COVID19 severity score assessment project and database. 4703 CXR of COVID19 patients.

2020

MaskedFace-Net is a dataset of human faces with a correctly or incorrectly worn mask (137,016 images) based on the dataset Flickr-Faces-HQ (FFHQ).

2020

A holistic dataset for movie understanding. 1.1K Movies, 60K trailers.

2020

ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses.

2020

The largest production recognition dataset containing 10,000 products frequently bought by online customers in JD.com

2020

HAA500, a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames.

2020

The dataset contains over 16.5k (16557) fully pixel-level labeled segmentation images.

2020

Human-centric Video Analysis in Complex Events. HiEve dataset includes the currently largest number of poses (>1M), the largest number of complex-event action labels (>56k), and one of the largest number of trajectories with long terms (with average trajectory length >480).

2020

AViD is a large-scale video dataset with 467k videos and 887 action classes. The collected videos have a creative-commons license.