MOSE is a large-scale dataset for video object segmentation in complex scenes. It contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks.
OVIS is a large-scale dataset for occluded video instance segmentation. It consists of 296k high-quality instance masks from 25 semantic categories, where heavy object occlusions usually occur.
1st Occluded Video Instance Segmentation Challenge in ICCV 2021
2nd Occluded Video Instance Segmentation Challenge in ECCV 2022
DanceTrack is a multi-human tracking dataset, emphasizing 1) uniform appearance: humans are in highly similar and almost undistinguished appearance, and 2) diverse motion: humans are in complicated motion pattern and their relative positions exchange frequently.
1st Multiple People Tracking in Group Dance Challenge in ECCV 2022
MUSES is a large-scale video dataset, designed to spur researches on a new task called multi-shot temporal event localization. MUSES has 31,477 event instances for a total of 716 video hours. The core nature of MUSES is the frequent shot cuts, for an average of 19 shots per instance and 176 shots per video, which induces large intra-instance variations.
YouMVOs is a dataset for multi-shot video object segmentation, consisting of 431K segmentation masks and 200 YouTube videos.
WarpDoc is a warped document image dataset for document restoration. It consists of 1,020 camera images of documents that were collected from scientific papers, magazines, envelopes, etc., which have different paper materials, page layouts, and contents.
SCUT-CTW-Context and ReCTS-Context
Two datasets are additionally annotated for a new task called Contextual Text Block Detection. The task aims to detect contextual text blocks which consist of one or multiple integral text units (e.g., characters, words, or phrases) in a natural reading order.