![]() |
CS 224S/LINGUIST 281   -     Winter 2007
Previous Year Final Project |
| Final Projects, Winter 2006 |
William Morgan
Abstract: Identification of action items in meeting recordings can provide
immediate access to salient information in a medium notoriously
dicult to search and summarize. To this end, we use
a maximum entropy model to automatically detect action item
utterances from multi-party audio meeting recordings. We compare
the effect of lexical, temporal, syntactic, semantic, and
prosodic features on system performance. We show that on a
corpus of action item annotations on the ICSI meeting recordings,
characterized by high imbalance and low inter-annotator
agreement, the system performs at an F measure of 31.92%.
Brad Moore
Abstract: This paper describes the design, implementation, and
evaluation of CellScribe, a multi-component system that
provides an automatic dictation service for cell phone
users through their own personal computers. CellScribe
was created as a final project for the Speech Recognition
and Synthesis class at Stanford University during the
Winter quarter of 2006.
Filip Krsmanovic and Curtis Spencer
Abstract: In this paper we introduce and motivate a novel method of
solving the open-set speaker identification problem in a
natural fashion. Our approach mirrors the human speaker
identification process and does not impose any unconventional
conversational limits. As such, we are able to dynamically
add new speaker profiles as appropriate, and continually
update known profiles. The identification is done during
natural conversation, and is text and length-independent. We
achieve these goals by combining a traditional statistical
cluster-based speaker identification system with an MDPbased
dialogue agent that uses reinforcement learning to set its
transition probabilities. The system is designed as part of the
Stanford AI Robot (STAIR) platform, which brings together
numerous AI fields to create a home/office robotic assistant.
We evaluate our combined system, obtaining good
identification accuracy (on the order of 90%), and valuable
observations in our context of home/office speaker
identification that will aid in future considerations and design.
We also find compelling preliminary results that suggest a
natural and conversational speaker identification system like
ours is a crucial part of a successful robotic assistant.
Brian Salomaki
Abstract: The SONIC speech recognition system from the University of
s Center for Spoken Language Research was used to
train a recognizer specifically on a corpus of voicemail data
available from the Linguistic Data Consortium. The resulting
recognition quality was far from perfect, but shows promising
results that may lead to usable transcriptions for a visual interface
to voicemail on cellphones.
Surabhi Gupta, Jason M. Brenier, and Wissam Kazan
Abstract: In this paper, we are interested in seeing how humans
carry out a dialog with a computer that varies
the synthesized voice used from being very humanlike
to unnatural (like a diphone synthesized voice).
We want to study the influence that speech synthesis
has on speech production. In particular, we want
to see whether humans responded differently to the
synthesized voice chosen by the computer. We present
some preliminary results on a study carried out with
9 subjects, and also present some areas for improvement
of the experimental methodology and changes
in the dialog used for the experiment.
Ari Greenberg, Daniel Holbert, and Kari Lee
Abstract: We propose an algorithm for improving text-to-speech
performance on newspaper headlines. After performing an
initial error analysis on headline text-to-speech in Festival, we
found that approximately 20% of errors in headlines were
fixed when the headline was rewritten as a full sentence. Our
algorithm attempts to capitalize on this fact by dynamically
expanding headlines into full sentences, from which we copy
the parts of speech and break patterns back into the original
headline. We evaluated this system against a standard
installation of Festival, and we obtained mixed results. We
conclude that to improve headline text-to-speech, a specific
headline-trained speech model is necessary.
Christopher Thad Hughes
Abstract: The creation of a new diphone voice based on the s
own voice is described. An analysis of the quality of the
voice is presented, using both subjective methods and the
Diagnostic Rhyme Test. The voice is found to be very
understandable but very synthetic and mechanical sounding.
The knowledge gained by the author is discussed.
Alyssa Liang
Abstract: This paper presents a diphone-based text-to-speech (TTS)
system for general American English. The system is based on
the Festival Speech Synthesis System [1]. The diphone
database was developed using a combination of nonsense
carrier words and natural words. Tools for automatically
determining phone segmentation and pitch contours were
used, but ultimately, the diphones were labeled by hand.
Lee-Ming Zen
Abstract: Automatic speech recognition of broadcast speech is difficult.
Sports radio broadcasts are even harder due to the additional
factors of noise introduced by the transmission along with unpredictable
factors such as intermittent crowd noise or other
environmental factors like buzzers. Attempting to recognize
the on-goings of a sports broadcast is quite useful for applications
such as automatic transcription of a play-by-play log
which currently must be performed manually. This paper reports
our studies on attempting to classify basketball broadcast
clips into plays. We focus on prosodic features in an attempt
to avoid the problems of speech recognition on noisy broadcast
speech and to hopefully exploit the way in which a broadcaster
describes certain types of plays. We find that while we are able
to do better than randomly guessing given the distribution, we
are unable to build a highly accurate classifier.
Steve Goldman and Konstantin Davydov
Abstract: As hardware and theory advance, automatic speech
recognition software is becoming more prevalent, particularly
with companies using them as phone services for customers.
As such, these systems have the burden of needing to process
speech regardless of the speaker. In general, these systems
achieve good results. However, speakers with non-native
accents pose a problem for ASRs because their speech is not
accurately recognized by acoustic models trained for
American English. Prior knowledge of accents lets ASRs use
custom acoustic models, improving performance. See
Figure 1. Therefore, there is value in an accent detector
can improve the ASR experience for accented speakers.
|