VQD: Visual Query Detection for Natural Scenes

We introduce the Visual Query Detection (VQD) task in which given a query in natural language and an image the system must produce 0 - N boxes that satisfy that query. VQD is related to several other tasks in computer vision, but it captures abilities these other tasks ignore. Unlike object detection, VQD can deal with attributes and relations among objects in the scene. In VQA, often algorithms produce the right answers due to dataset bias without `looking' at relevant image regions. Referrring Expression Recognition (RER) datasets have short and often ambiguous prompts, and by having only a single box as an output, they make it easier to exploit dataset biases. VQD requires goal-directed object detection and outputting a variable number of boxes that answer a query.

Download VQD from our Github repo.

We created VQDv1, the first dataset for VQD. VQDv1 has three distinct query categories: Some example images from our dataset are given below.

VQDv1 Stats

Compared to other dataset VQD has the largest number of questions and the number of bounding boxes range from 0-15. In summary, it has

Click to read about VQD in our NAACL-2019 paper.

Manoj Acharya

Manoj Acharya

Karan Jariwala

Karan Jariwala

Prof. Christopher Kanan

Christopher Kanan