This paper presents an incremental learning framework for mobile robots localizing the human sound source using a microphone array in a complex indoor environment consisting of multiple rooms. In contrast to conventional approaches that leverage direction-of-arrival (DOA) estimation, the framework allows a robot to accumulate training data and improve the performance of the prediction model over time using an incremental learning scheme. Specifically, we use implicit acoustic features obtained from an auto-encoder together with the geometry features from the map for training. A self-supervision process is developed such that the model ranks the priority of rooms to explore and assigns the ground truth label to the collected data, updating the learned model on-the-fly. The framework does not require pre-collected data and can be directly applied to real-world scenarios without any human supervisions or interventions. In experiments, we demonstrate that the prediction accuracy reaches 67% using about 20 training samples and eventually achieves 90% accuracy within 120 samples, surpassing prior classification-based methods with explicit GCC-PHAT features.
This paper presents a design that jointly provides hand pose sensing, hand localization, and haptic feedback to facilitate real-time stable grasps in Virtual Reality (VR). The design is based on an easy-to-replicate glove-based system that can reliably perform (i) a high-fidelity hand pose sensing in real time through a network of 15 IMUs, and (ii) the hand localization using a Vive Tracker. The supported physics-based simulation in VR is capable of detecting collisions and contact points for virtual object manipulation, which drives the collision event to trigger the physical vibration motors on the glove to signal the user, providing a better realism inside virtual environments. A caging-based approach using collision geometry is integrated to determine whether a grasp is stable. In the experiment, we showcase successful grasps of virtual objects with large geometry variations. Comparing to the popular LeapMotion sensor, we demonstrate the proposed glove-based design yields a higher success rate in various tasks in VR. We hope such a glove-based system can simplify the data collection of human manipulations with VR.
This paper presents a mirroring approach, inspired by the neuroscience discovery of the mirror neurons, to transfer demonstrated manipulation actions to robots. Designed to address the different embodiments between a human (demonstrator) and a robot, this approach extends the classic robot Learning from Demonstration (LfD) in the following aspects: i) It incorporates fine-grained hand forces collected by a tactile glove in demonstration to learn robot's fine manipulative actions; ii) Through model-free reinforcement learning and grammar induction, the demonstration is represented by a goal-oriented grammar consisting of goal states and the corresponding forces to reach the states, independent of robot embodiments; iii) A physics-based simulation engine is applied to emulate various robot actions and mirrors the actions that are functionally equivalen} to the human's in the sense of causing the same state changes by exerting similar forces. Through this approach, a robot reasons about which forces to exert and what goals to achieve to generate actions (i.e., mirroring), rather than strictly mimicking demonstration (i.e., overimitation). Thus the embodiment difference between a human and a robot is naturally overcome. In the experiment, we demonstrate the proposed approach by teaching a real Baxter robot with a complex manipulation task involving haptic feedback---opening medicine bottles.
We present a novel Augmented Reality (AR) approach, through Microsoft HoloLens, to address the challenging problems of diagnosing, teaching, and patching interpretable knowledge of a robot. A Temporal And-Or graph (T-AOG) of opening bottles is learned from human demonstration and programmed to the robot. This representation yields a hierarchical structure that captures the compositional nature of the given task, which is highly interpretable for the users. By visualizing the knowledge structure represented by the T-AOG and the decision making process by parsing a T-AOG, the user can intuitively understand what the robot knows, supervise the robot's action planner, and monitor visually latent robot states (e.g., the force exerted during interactions). Given a new task, through such comprehensive visualizations of robot's inner functioning, users can quickly identify the reasons of failures, interactively teach the robot with a new action, and patch it to the knowledge structure represented by the T-AOG. In this way, the robot is capable of solving similar but new tasks only through minor modifications provided by the users interactively. This process demonstrates the interpretability of our knowledge representation and the effectiveness of the AR interface.
Contact forces of the hand are visually unobservable, but play a crucial role in understanding hand-object interactions. In this paper, we propose an unsupervised learning approach for manipulation event segmentation and manipulation event parsing. The proposed framework incorporates hand pose kinematics and contact forces using a low-cost easy-to-replicate tactile glove. We use a temporal grammar model to capture the hierarchical structure of events, integrating extracted force vectors from the raw sensory input of poses and forces. The temporal grammar is represented as a temporal And-Or graph (T-AOG), which can be induced in an unsupervised manner. We obtain the event labeling sequences by measuring the similarity between segments using the Dynamic Time Alignment Kernel (DTAK). Experimental results show that our method achieves high accuracy in manipulation event segmentation, recognition and parsing by utilizing both pose and force data.
We present a design of an easy-to-replicate glove-based system that can reliably perform simultaneous hand pose and force sensing in real time, for the purpose of collecting human hand data during fine manipulative actions. The design consists of a sensory glove that is capable of jointly collecting data of finger poses, hand poses, as well as forces on palm and each phalanx. Specifically, the sensory glove employs a network of 15 IMUs to measure the rotations between individual phalanxes. Hand pose is then reconstructed using forward kinematics. Contact forces on the palm and each phalanx are measured by 6 customized force sensors made from Velostat, a piezoresistive material whose force-voltage relation is investigated. We further develop an open-source software pipeline consisting of drivers and processing code and a system for visualizing hand actions that is compatible with the popular Raspberry Pi architecture. In our experiment, we conduct a series of evaluations that quantitatively characterize both individual sensors and the overall system, proving the effectiveness of the proposed design.
Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a manipulation model to execute tasks with multiple stages and variable structure, which typically are not suitable for most robot manipulation approaches. The model is learned from human demonstration using a tactile glove that measures both hand pose and contact forces. The tactile glove enables observation of visually latent changes in the scene, specifically the forces imposed to unlock the child-safety mechanisms of medicine bottles. From these observations, we learn an action planner through both a top-down stochastic grammar model (And-Or graph) to represent the compositional nature of the task sequence and a bottom-up discriminative model from the observed poses and forces. These two terms are combined during planning to select the next optimal action. We present a method for transferring this human-specific knowledge onto a robot platform and demonstrate that the robot can perform successful manipulations of unseen objects with similar task structure.
This paper presents a novel infrastructural traffc monitoring approach that estimates traffc information by combining two sensing techniques. The traffc information can be obtained from the presented approach includes passing vehicle counts, corresponding speed estimation and vehicle classifcation based on size. This approach uses measurement from an array of Lidars and video frames from a camera and derives traffc information using two techniques. The frst technique detects passing vehicles by using Lidars to constantly measure the distance from laser transmitter to the target road surface. When a vehicle or other objects pass by, the measurement of the distance to road surface reduces in each targeting spot, and triggers detection event. The second technique utilizes video frames from camera and performs background subtraction algorithm in each selected Region of Interest (ROI), which also triggers detection when vehicle travels through each ROI. Based on detection events, vehicle location is estimated respectively. The fnal location estimation is derived by fusing the two estimation in the framework of Recursive Bayesian Estimation (RBE). Vehicle counts, speed estimation and classifcation are then performed using the vehicle location estimation in each time step. The approach achieves high reliability by combing the strength of both sensors. A sensor prototype has been built and multiple feld experiments have been completed. High reliability is demonstrated in experiment by achieving more than 95% accuracy both in vehicle counting and classifcation.
This paper describes a non-field-of-view (NFOV) localization approach for a mobile robot in an unknown environment based on an acoustic signal combined with the geometrical information from an optical sensor. The approach estimates the location of a target through the mobile robot’s sensor observation frame, which consists of a combination of diffraction and reflection acoustic signals and a 3-D environment geometrical description. This fusion of audio-visual sensor observation likelihoods allows the robot to estimate the NFOV target. The diffraction and reflection observations from the microphone array generate the acoustic joint observation likelihood. The observed geometry also determines far-field or near-field acoustic conditions to improve the estimation of the sound direction of arrival. A mobile robot equipped with a microphone array and an RGB-D sensor was tested in a controlled environment, an anechoic chamber, to demonstrate the NFOV localization capabilities. This resulted in +/- 18 degrees, and less than 0.75 m error in angle and distance estimation, respectively.
This paper presents a novel design of infrastructural traffic monitoring that performs vehicle counts, speed estimation, and vehicles classification by deploying three different approaches using two types of sensor, infrared (IR) cameras and laser range finders (LRFs). The first approach identifies passing vehicles by using LRFs and measuring the time-of-flight to the ground, which changes when vehicles pass. In the second approach, LRFs are used only to project a dotted line onto ground, and an IR camera identifies passing vehicles by recognizing the change of location of these laser dots in its images. The third approach utilizes an IR camera only and recognizes passing vehicles in each frame using background subtraction and edge detection algorithms. The design achieves high reliability because each approach has different strengths. A prototype system has been built and the field tests at a public road show promising results by achieving high reliability by having 95% accuracy in traffic counting and speed estimation.
This paper presents an approach to the recursive Bayesian estimation of non-field-of-view (NFOV) sound source tracking based on reflection and diffraction signals with an incorporation of optical sensors. The approach takes multi-modal sensoy fusion of a mobile robot, which combines an optical 3D environment geometrical description with a microphone array acoustic signal to estimate the target location. The robot estimates target location either in the field-of-view (FOV) or in the NFOV by fusion of sensor observation likelihoods. For the NFOV case, the microphone array provides reflection and diffraction observations to generate a joint acoustic observation likelihood. With the data fusion between the 3D description and the acoustic observation, the target estimation is performed in an unknown environment. Finally, the sensor observation combined with the motion model of the target iteratively performs tracking within a recursive Bayesian estimation framework. The proposed approach was tested with a microphone array with an RGBD sensor in a controlled anechoic chamber to demonstrate the NFOV tracking capabilities for a moving target.