Kismet is a robot head made in the late 1990s at Massachusetts Institute of Technology by Dr. Cynthia Breazeal as an experiment in affective computing; a machine that can recognize and simulate emotions. The name Kismet comes from a Turkish word meaning "fate" or sometimes "luck".
In order for Kismet to properly interact with human beings, it contains input devices that give it auditory, visual, and proprioception abilities. Kismet simulates emotion through various facial expressions, vocalizations, and movement. Facial expressions are created through movements of the ears, eyebrows, eyelids, lips, jaw, and head. The cost of physical materials is an estimated US$25,000.
In addition to the computers mentioned above, there are four Motorola 68332s, nine 400 MHz PCs, and another 500 MHz PC.
Kismet's social intelligence software system, or synthetic nervous system (SNS), was designed with human models of intelligent behavior in mind. It contains six subsystems as follows.
This system processes raw visual and auditory information from cameras and microphones. Kismet's vision system can perform eye detection, motion detection and, albeit controversial, skin-color detection. Whenever Kismet moves its head, it momentarily disables its motion detection system to avoid detecting self-motion. It also uses its stereo cameras to estimate the distance of an object in its visual field, for example to detect threats—large, close objects with a lot of movement.
Kismet's audio system is mainly tuned towards identifying affect in infant-directed speech. In particular, it can detect five different types of affective speech: approval, prohibition, attention, comfort, and neutral. The affective intent classifier was created as follows. Low-level features such as pitch mean and energy (volume) variance were extracted from samples of recorded speech. The classes of affective intent were then modeled as a gaussian mixture model and trained with these samples using the expectation-maximization algorithm. Classification is done with multiple stages, first classifying an utterance into one of two general groups (e.g. soothing/neutral vs. prohibition/attention/approval) and then doing more detailed classification. This architecture significantly improved performance for hard-to-distinguish classes, like approval ("You're a clever robot") versus attention ("Hey Kismet, over here").