Link to the paper

Link to Workshop

Joint work with Ameen Ali, Idan Schwartz, Lior Wolf

Abstract

Conversation with visual cues is challenging due to their ambiguity. Many questions are open to interpretation and have multiple correct answers. Thus, the reason behind choosing a particular answer is important to understand. Attention plays a key role in providing interpretability to visual dialog models. Attention, however, can’t explain the factors that lead to a certain answer. In this work, we examine why a model chooses a specific answer. The relevance scores for each element of the input are determined using the Deep Taylor Decomposition method. We demonstrate qualitatively the benefit of our approach over attention. Furthermore, we study the predictive power of each modality by calculating intra-entropy and inter-distance.

Short Video