Our system architecture is integrated by the Robot Operating System (ROS). The robot can provide unique services by spatial concepts formed from multimodal information obtained by the robot. Robot software consists of the following functions:
Our main research focuses on spatial concept formation using autonomous robots through unsupervised machine learning [1-4]. In particular, we use a nonparametric Bayesian generative model. The spatial concept is a place category constructed on the basis of the robot's experiences.
The robot learns spatial concepts from multimodal information such as self-positions, human speech signals, and image features as shown in right figure.
In the speech recognition task, the robot is required to hear, recognize, and respond to the human voice. We use the rospeex* for speech recognition. It is a cloud-based multilingual communication package for ROS. It has high recognition accuracy and can be easily implemented using its APIs in Python or C++. Rospeex receives a speech signal from the accompanying waveform monitor application. After noise reduction and speech segment detection, the speech signal is converted into text through the speech recognition engine on the cloud.
However, rospeex does not work without a network environment. If the communication line (WiFi-connection) is not stable, the recognition precision will degrade. Therefore, we will also use
another speech recognition system Julius**.
Julius is a high-performance open-source software of large vocabulary continuous speech recognition (LVCSR) for speech-related researchers and developers. We can perform speech recognition without network connection by using Julius, to prepare an acoustic model, a word dictionary, and a language model. These two types of speech recognition systems are used on a case-by-case basis.
In order to perform the manipulation task, it is necessary to perform object recognition by using a camera attached to the robot. You Only Look Once (YOLO***) is used as object recognition
algorithm in our team. YOLO is one of the CNN based object recognition methods. It simultaneously estimates bounding boxes and classes of objects from captured images as shown in the
In the previous method such as R-CNN, the object classification was performed after detection of region proposal. R-CNN need to search region proposals of several thousand times on a single image. In contrast, YOLO can obtain classes and bounding boxes of objects by segmenting the entire image for each grid region. YOLO is sufficient to perform a single search on single image since the entire image is seen at the time of testing. Therefore, the detection accuracy of YOLO is slightly inferior to Faster R-CNN, but it is possible to achieve the detection speed that can be used for robots.
In the manipulation task, the robot has to detects and recognizes objects placed on shelves and tables, and to manipulate these objects. Our robot measures the three-dimensional position of the
object based on the distance information obtained by the depth sensor and the object recognition result obtained by YOLO. Then, the robot grasps the target object by inverse kinematics.
In addition, the robot specifies the movement destination of the object to be grasped from the depth image and moves the object.
Many technologies such as map generation, localization, path planning and obstacle avoidance are required for the robot to autonomously move in the environment. Our team uses hector SLAM for map generation. Hector SLAM is one of the SLAM method without odometry information. Adaptive MCL~(AMCL) is used for the localization method . Dijkstra method is used for global planning and Dynamic Window Approach~(DWA) algorithm is used for local planning such as dynamic obstacle avoidance .
Ishibushi et al. proposed a computational model to estimate the spatial region of a place based on spatial distance and the distance of visual features. The study revealed that the accuracy of the self-localization is improved by classifying the object recognition result by using a Convolutional Neural Network (CNN)  and the self-position information by Monte Carlo localization (MCL). The experimental results showed that the method was able to converge particles for self-localization and to reduce estimation errors in the global self-localization.
Taniguchi et al. proposed a method for simultaneously estimating self-position and words from noisy sensory information and utterances. Their method integrated ambiguous speech-recognition results with the self-localization method for learning spatial concepts. Furthermore, Taniguchi et al. proposed nonparametric Bayesian spatial concept acquisition method (SpCoA) based on an unsupervised word-segmentation method known as latticelm****. This method enables word segmentation with consideration of phoneme errors in speech recognition more efficiently than the nested Pitman--Yor language model (NPYLM) does.
Hagiwara et al.  proposed a model to infer the bottom-up hierarchical structure of places based on multimodal information such as position and visual information with hierarchical multimodal latent Dirichlet allocation (hMLDA). The experimental results demonstrated the formation of hierarchical place concepts by hMLDA.
As results of our research in @Home tasks, we succeeded in having the robot acquire the spatial concept in the environment of RoboCup Japan Open 2016. Right figure shows the example of formed spatial concepts that consist of images, occurrence probabilities of objects, and word probabilities. Based on this spatial concept, we demonstrated navigation to move the robot to the ``Living room'' by voice command.
*Rospeex: a cloud-based speech communication toolkit [LINK].
**Julius: Open-Source Large Vocabulary Continuous Speech Recognition Engine [LINK].
***YOLO: a state-of-the-art, real-time object detection system [LINK].
****latticelm: an unsupervised word-segmentation tool [LINK].