STANFORD CS 224S/LINGUIST 281   -     Winter 2007
Previous Year Final Project
Final Projects, Winter 2006

  • Automatically Detecting Action Items in Audio Meeting Recordings
  • William Morgan
    Abstract: Identification of action items in meeting recordings can provide immediate access to salient information in a medium notoriously dicult to search and summarize. To this end, we use a maximum entropy model to automatically detect action item utterances from multi-party audio meeting recordings. We compare the effect of lexical, temporal, syntactic, semantic, and prosodic features on system performance. We show that on a corpus of action item annotations on the ICSI meeting recordings, characterized by high imbalance and low inter-annotator agreement, the system performs at an F measure of 31.92%.

  • CellScribe: Remote Dictation Using ASR
  • Brad Moore
    Abstract: This paper describes the design, implementation, and evaluation of CellScribe, a multi-component system that provides an automatic dictation service for cell phone users through their own personal computers. CellScribe was created as a final project for the Speech Recognition and Synthesis class at Stanford University during the Winter quarter of 2006.

  • Natural Open-Set Speaker Identification
  • Filip Krsmanovic and Curtis Spencer
    Abstract: In this paper we introduce and motivate a novel method of solving the open-set speaker identification problem in a natural fashion. Our approach mirrors the human speaker identification process and does not impose any unconventional conversational limits. As such, we are able to dynamically add new speaker profiles as appropriate, and continually update known profiles. The identification is done during natural conversation, and is text and length-independent. We achieve these goals by combining a traditional statistical cluster-based speaker identification system with an MDPbased dialogue agent that uses reinforcement learning to set its transition probabilities. The system is designed as part of the Stanford AI Robot (STAIR) platform, which brings together numerous AI fields to create a home/office robotic assistant. We evaluate our combined system, obtaining good identification accuracy (on the order of 90%), and valuable observations in our context of home/office speaker identification that will aid in future considerations and design. We also find compelling preliminary results that suggest a natural and conversational speaker identification system like ours is a crucial part of a successful robotic assistant.

  • Automatic Speech Recognition on Voicemail Data
  • Brian Salomaki
    Abstract: The SONIC speech recognition system from the University of s Center for Spoken Language Research was used to train a recognizer specifically on a corpus of voicemail data available from the Linguistic Data Consortium. The resulting recognition quality was far from perfect, but shows promising results that may lead to usable transcriptions for a visual interface to voicemail on cellphones.

  • Effects of Natural vs. Synthesized Speech on Human Speech Production
  • Surabhi Gupta, Jason M. Brenier, and Wissam Kazan
    Abstract: In this paper, we are interested in seeing how humans carry out a dialog with a computer that varies the synthesized voice used from being very humanlike to unnatural (like a diphone synthesized voice). We want to study the influence that speech synthesis has on speech production. In particular, we want to see whether humans responded differently to the synthesized voice chosen by the computer. We present some preliminary results on a study carried out with 9 subjects, and also present some areas for improvement of the experimental methodology and changes in the dialog used for the experiment.

  • Improving Text-to-Speech Quality for Newspaper Headlines
  • Ari Greenberg, Daniel Holbert, and Kari Lee
    Abstract: We propose an algorithm for improving text-to-speech performance on newspaper headlines. After performing an initial error analysis on headline text-to-speech in Festival, we found that approximately 20% of errors in headlines were fixed when the headline was rewritten as a full sentence. Our algorithm attempts to capitalize on this fact by dynamically expanding headlines into full sentences, from which we copy the parts of speech and break patterns back into the original headline. We evaluated this system against a standard installation of Festival, and we obtained mixed results. We conclude that to improve headline text-to-speech, a specific headline-trained speech model is necessary.

  • Creating a New US English Voice for s Diphone Text-to-Speech Synthesizer Festival
  • Christopher Thad Hughes
    Abstract: The creation of a new diphone voice based on the s own voice is described. An analysis of the quality of the voice is presented, using both subjective methods and the Diagnostic Rhyme Test. The voice is found to be very understandable but very synthetic and mechanical sounding. The knowledge gained by the author is discussed.

  • A Diphone Speech Synthesizer in Festival
  • Alyssa Liang
    Abstract: This paper presents a diphone-based text-to-speech (TTS) system for general American English. The system is based on the Festival Speech Synthesis System [1]. The diphone database was developed using a combination of nonsense carrier words and natural words. Tools for automatically determining phone segmentation and pitch contours were used, but ultimately, the diphones were labeled by hand.

  • Classification of Sports Broadcast Speech Using Prosodic Features
  • Lee-Ming Zen
    Abstract: Automatic speech recognition of broadcast speech is difficult. Sports radio broadcasts are even harder due to the additional factors of noise introduced by the transmission along with unpredictable factors such as intermittent crowd noise or other environmental factors like buzzers. Attempting to recognize the on-goings of a sports broadcast is quite useful for applications such as automatic transcription of a play-by-play log which currently must be performed manually. This paper reports our studies on attempting to classify basketball broadcast clips into plays. We focus on prosodic features in an attempt to avoid the problems of speech recognition on noisy broadcast speech and to hopefully exploit the way in which a broadcaster describes certain types of plays. We find that while we are able to do better than randomly guessing given the distribution, we are unable to build a highly accurate classifier.

  • Accent Detection using Acoustic Features
  • Steve Goldman and Konstantin Davydov
    Abstract: As hardware and theory advance, automatic speech recognition software is becoming more prevalent, particularly with companies using them as phone services for customers. As such, these systems have the burden of needing to process speech regardless of the speaker. In general, these systems achieve good results. However, speakers with non-native accents pose a problem for ASRs because their speech is not accurately recognized by acoustic models trained for American English. Prior knowledge of accents lets ASRs use custom acoustic models, improving performance. See Figure 1. Therefore, there is value in an accent detector can improve the ASR experience for accented speakers.