Speech Recognition 2003 Project Page

Voice authentication

(Last updated: 29-11-2003)

 

Contents

Project Members

Roderick de Jong

Harmen van der Spek

Goals
Uniquely identifying a known (which means known in our database) person by his/her voice.

Abstract

Every person in the world has each own unique properties. One well known is the fingerprint. We want to do something similar, but we want to use a person's voice to establish the identity of that person. First, a user needs to given an example utterance. Then, a user can be verified by comparing a given utterance to all the stored example utterances.

Introduction

Speech processing applications can roughly be divided into 3 separate groups:

  1. Speech recognition
  2. Language recognition
  3. Speaker recognition


We will focus on speaker recognition. This subject is about establishing or verifying the speaker's identity.
There are two main problems involved in speaker recognition:

  • Who is the speaker?
  • Is the speaker who he/she claims to be?


We have implemented two different approaches. The first compares an input signal to stored templates. The one with the best score is selected as a match and we will use a certain threshold to prevent incorrect recognitions. This process is text dependent. The algorithm used here is “Dynamic Time Warping”

 

The second approach is based on Vector Quantization. This algorithm is discussed in detail in our project paper. This method is text independent.


Work done in this area


Click here to view this section.
It is too much to insert here.

Design

We divide the problem into three sections:

  1. Recorder
  2. Trainer
  3. Recognizer


The recorder is an application that captures sound input from a microphone, computes the cepstral coefficients and stores them to a file. The filename is equal to the provided name of the user.

The trainer will take all files with the extension “.vqcep” and generates a codebook for each user. We will explain this in more depth in our project paper.

The recognizer will record a spoken utterance and it uses two algorithms to determine the user’s identity. Those algorithms are:

  • Dynamic Time Warping
  • Vector Quantization

Implementation

All applications make use of a library we created. This library supports all functionality needed by the three programs we created, namely: recordcep, gencodebook and speakerrec.


Recorder (recordcep)

The recorder uses code from Sphinx to capture speech from a microphone. We cut of leading and trailing silence by using a threshold. When a certain minimum amplitude occurs in the signal, we see this as the start of the signal. We include some extra data left and right of the boundaries, to be sure that we don't leave out important information. This signal will then be normalized such that the peak level will be 98% of the maximum amplitude. The user will hear the input given and can decide if this is the utterance that should be used. Two samples will be recorded, a short one for text dependent recognition and one for text independent recognition.

Then we extract feature vectors using functions from Sphinx3. These vectors consist of the cepstrum, not delta- and deltadelta cepstrum. Using them didn’t improve our recognition. We wrote our own function to write these vectors to a file. The file is written in ASCII. This file is stored in a subdirectory data. These files are called name.cep. Name is the name provided by the user.

 

Trainer (gencodebook)

The trainer uses code from Sphinx to capture speech from a microphone. We cut of leading and trailing silence by using a threshold. When a certain minimum amplitude occurs in the signal, we see this as the start of the signal. We include some extra data left and right of the boundaries, to be sure that we don't leave out important information. This signal will then be normalized such that the peak level will be 98% of the maximum amplitude. The user will hear the input given and can decide if this is the utterance that should be used.

Then we extract the feature vectors using functions from Sphinx3. We wrote our own function to write these vectors to a file. The file is written in ASCII. This file is stored in a subdirectory data. These files are called name.cep. Name is the name provided by the user.


Recognizer
The recognizer records a utterance from a user. It then performs the same signal preprocessing as the trainer, to create an input signal that is as close as possible as the signal recorded by the signal. Then, it will extract the feature vectors and compare this to each entry from the database. This comparison is done with the dynamic time warping algorithm. As a distance measure between two frames, we use the Euclidean distance. Some links about can be found at the project links of this page. We will go in more depth about the DTW algorithm in our technical paper.

Experimentation

We created a database with some speech samples using recordcep. Then, with a few people, we tried ten recognitions. The results of these tests can be found here. These tests are not very extensive, but it gives a good indication what we can expect from the combination of the DTW- and VQ algorithm. False rejections occur, but we didn't find any false acceptances. And most of the time, our results are pretty good (about 80%), except when too difficult passphrases are used or when a user's voice is different due to a cold. We also couldn't test what happens after a few year with a persons voice (obvious).

Software Requirements

We use Microsoft Visual C++ 6.0 as our development platform. Our software should run on any computer with Microsoft Windows 95 and upwards, but we only tested on Windows XP machines.


Hardware Requirements

Any machine that runs Windows should do. We tested on a Pentium II-400. This worked fine. You only need a soundcard with a microphone jack. But every self respecting soundcard has that, as far as we know.

 

Workplan

  1. 24-09-03: Project Group and Preliminary Proposal
  2. 08-10-03: Project Proposal
  3. 05-11-03: Preliminary Project Presentation and Discussion
  4. 26-11-03: Final Presentations and Demos of the Project


Deliverables
Below you can find the documentation and the program source itself. Please note that the source is quite well documented and the project paper describes more the concept,  whereas the in-source documentation goes in more detail about the implementation.



References
General:

  • Spoken  Language Processing Huang, Acero, Hon. ISBN 0-13-022616-5
  • Links in text document about "Work done in this area"

Dynamic Time warping:


Vector Quantization