Indoor Scene Knowledge Acquisition using Natural Language Descriptions

eblackwood — Fri, 12 Jul 2013 18:58:53 +0000

Saranya Kesavan. Unpublished Master’s Thesis, May 2013, 91��. (N.A. Giudice: thesis advisor)

The existing research addressing non-visual indoor navigation is limited to route guidance between locations (i.e., the corridor network). This focus ignores many critical regions contained within indoor spaces (e.g., rooms, lobbies, etc.), locations which are often as challenging to learn and navigate without vision as are the routes connecting them. To address this challenge, this thesis investigates the use of natural language (NL) descriptions as a non-visual medium for providing access to indoor scenes, including room structure, furniture placement, and location of salient landmarks. The work is part of a larger project to develop a system, called the Describer for Indoor Scenes (DISc) that uses automatically generated NL descriptions to represent indoor scenes based on photos taken in real-time from mobile devices. In order to develop cognitively comprehensible NL descriptions of indoor scenes, it is critical to first understand how humans describe and interpret the scene in order to support spatial behavior. To this end, six behavioral experiments were conducted to characterize scene descriptions generated by human observers and to optimize these descriptions based on cognitive constraints and the structure of linguistic information to be included to best support non-visual learning, representation, and navigation.

The visual information that can be captured about a scene from photographs is potentially limited, both in quality and quantity, compared to the information apprehended from real time scene perception. Importantly for the DISc system, results from experiments 1, 2, and 3 converge to demonstrate that photographic observations are functionally equivalent to real time observations of indoor scenes in supporting spatial behavior and show that photographs can be used as information source in DISc. The data collected in these experiments showed that humans adopted different scene description strategies. To understand how the description strategy (i.e., order of objects) affected scene learning and reconstruction, a 4th behavioral experiment was conducted. Results from this experiment suggest that following a cyclic path while describing an indoor scene (called a “Round-About strategy”) was the most efficient approach for acquiring and representing spatial knowledge.

The results from the first four experiments elucidated that people used two different angular units (clock face and degree measurements) to describe directional information. However, it was not clear from the extant literature how angular units affect spatial apprehension of the listener or which measure yields the most accurate performance. As directional information is critical for specifying the location of objects in a scene, this question was addressed in a fifth experiment, with results demonstrating that the most accurate performance manifested when angular directions were given as clock face units rather than degree measurements (i.e., 1:00 versus 30 degrees). Results also demonstrated that participants were equally accurate at producing angular values of 15 degrees or half hour increments (e.g., 1:30), which is meaningful as this is a 100% increase in precision from the standard clock face units employed in previous studies.

The sixth and final behavioral experiment was conducted to investigate whether the optimized NL scene descriptions support non-visual navigation of indoor scenes and if performance differs when using static or updated descriptions, meaning that they either were given from a fixed user perspective in the scene (as was done in the earlier experiments) or that the perspective changed based on the user’s position and orientation. Results showed a clear advantage for updated NL descriptions on navigation accuracy, indicating that to be maximally effective, DISc should implement descriptions based on the user’s real-time position and orientation as they move. Taken together, the results of six human experiments extend earlier research with route navigation by showing that optimizing NL indoor scene descriptions based on perceptual and cognitive factors led to efficient spatial learning, representation, and navigation. These empirical results provide the much needed proof of concept for the efficacy of future development of DISc as a fully automated NL scene description system.

Indoor scene knowledge acquisition using a natural language interface

Thu, 02 Aug 2012 15:44:34 +0000

Abstract: This paper proposes an interface that uses automatically-generated Natural Language (NL) descriptions to describe indoor scenes based on photos taken of that scene from smartphones or other portable camera-equipped mobile devices. The goal is to develop a non-visual interface based on spatio-linguistic descriptions which could assist blind people in knowing the contents of an indoor scene (e.g., room structure, furniture, landmarks, etc.) and supporting efficient navigation of this space based on these descriptions. In this paper, we concentrate on understanding the most salient content of a stereotypic indoor scene that is described by an observer, categorizing the description strategies employed in this process, and evaluating the best presentation of directional information using NL descriptions in order to support the most accurate spatial behaviors and mental representations of these scenes by means of human behavioral experiments. This knowledge will then be used to develop a domain specific indoor scene ontology, which in turn will be used to generate automated NL descriptions of indoor scenes based on their photographs, which will finally be integrated into a real-time non-visual scene description system.

Citation: Kesavan, S. & Giudice, N.A. (2012). Indoor scene knowledge acquisition using a natural language interface. In C. Graf, N.A. Giudice, & F. Schmid (Eds.) Proceedings of the international Workshop on Spatial Knowledge Acquisition with Limited Information Displays (SKALID’12), pp. 1-6. August, Monastery Seeon, Germany.

Download PDF

Kesavan, S. – VEMI Lab

Indoor Scene Knowledge Acquisition using Natural Language Descriptions

Indoor scene knowledge acquisition using a natural language interface