Choosing For the Right Reasons: Interpretability of Answers for Visual Dialog

Joint work with Ameen Ali, Idan Schwartz, Lior Wolf

Abstract

Conversation with visual cues is challenging due to their ambiguity. Many questions are open to interpretation and have multiple correct answers. Thus, the reason behind choosing a particular answer is important to understand. Attention plays a key role in providing interpretability to visual dialog models. Attention, however, can’t explain the factors that lead to a certain answer. In this work, we examine why a model chooses a specific answer. The relevance scores for each element of the input are determined using the Deep Taylor Decomposition method. We demonstrate qualitatively the benefit of our approach over attention. Furthermore, we study the predictive power of each modality by calculating intra-entropy and inter-distance.

Abstract

Short Video