Skip to the content.

OVIS (short for Occluded Video Instance Segmentation) is a new large scale benchmark dataset for video instance segmentation task. It is designed with the philosophy of perceiving object occlusions in videos, which could reveal the complexity and the diversity of real-world scenes.

Occluded Video Instance Segmentation
Jiyang Qi1,2, Yan Gao2, Yao Hu2, Xinggang Wang1, Xiaoyu Liu2, Xiang Bai1, Serge Belongie3, Alan Yuille4, Philip Torr5, Song Bai2,5
1Huazhong University of Science and Technology 2Alibaba Group 3Cornell University
4Johns Hopkins University 5University of Oxford



Can our video understanding systems perceive objects when a heavy occlusion exists in a scene?

To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems are not satisfying. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 14.4, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario.

Explore OVIS

OVIS Consists of:

Given a video, all the objects belonging to the pre-defined category set are exhaustively annotated. All the videos are annotated per 5 frames.

Distinctive Properties


The 25 semantic categories in OVIS are Person, Bird, Cat, Dog, Horse, Sheep, Cow, Elephant, Bear, Zebra, Giraffe, Poultry, Giant panda, Lizard, Parrot, Monkey, Rabbit, Tiger, Fish, Turtle, Bicycle, Motorcycle, Airplane, Boat, and Vehicle.

For a detailed description of OVIS, please refer to our paper.


2592056 2930398 2932104 3021160
Visualization of the annotations.


Dataset Download

We provide the frames and annotations.

Dataset Download
New Evaluation Server for Workshop
Old Evaluation Server

The annotations are COCO-style, just like Youtube-VIS. So it’s nearly cost-free to adapt your Youtube-VIS code for OVIS. (Please refer to this repo for the code of loading annotations.)


The code and models of the baseline method are released on [coming soon].

The evaluation metric is the same as Youtube-VIS’s, so you can use the evalution code provided by them [link]


If the dataset helps your research, please cite this paper.

  title={Occluded Video Instance Segmentation},
  author={Jiyang Qi and Yan Gao and Yao Hu and Xinggang Wang and Xiaoyu Liu and Xiang Bai and Serge Belongie and Alan Yuille and Philip Torr and Song Bai},
  journal={arXiv preprint arXiv:2102.01558},


For questions and suggestions, please contact Jiyang Qi (jiyangqi at hust dot edu dot cn).