Free For All: ECE Projects

Showing posts with label ECE Projects. Show all posts

Thursday, 25 August 2011

Voice decoder for vowels

Introduction

In our final project, we created a smart voice decoder system that is capable of recognizing vowels in human speech. The audio input is sampled through a microphone/amplifier circuit and analyzed in real time using the Mega644 MCU. The user can record and analyze his/her speech using both hardware buttons and custom commands through PuTTY. In addition, the final product also supports a simple voice password system where the user can set a sequence of vowels as password to protect a message via PuTTY. The message can be decoded by repeating the same sequence via the microphone.

Some of the topics explored in this project are: Fast Walsh Transform, sampling theorems and human speech analysis.

Here is a video summary of our project
High Level Design
Design Rationale

The idea of our project stemmed from seeing one of the previous ECE 4760 final projects, Musical Water Fountain. In their project , they used Fast Walsh Transform to analyze audio signal generated by a MP3 player (shown in table below).

LED 0 1 2 3 4 5 6 7
Freq 0-170 170-310 310-420 420-560 560-680 680-820 820-930 930-10000

Then they would turn on the LED that corresponded to the most energetic frequency division in the input frequency spectrum. This made us wonder if identifying speech is possibly by a method similar to this.

In fact, with today's technology, speech recognition is fully realizable and can even be fully synthesized. However, most of the software that deals with speech recognition require extensive computation and are very expensive. With the limited computation power of mega644 and a $75 project budget, we wanted to make a simple, smart voice recognition system that is capable of recognizing simple vowels.

After careful research and several discussions with Bruce, we found that vowels can be characterized by 3 distinct peaks in their frequency spectrum. This means if we perform a transform to input speech signal, the frequency spectrum profile will contain characteristic peaks that correspond to the most energetic frequency component. Then if we check to see if the 3 peaks in the input fall in the ranges we defined for a specific vowel, we will be able to deduce is that vowel component was present or not in the user's speech.
Logical Structure

The main structure of our decoder system centers on the mega644 MCU. Our program allows the MCU to coordinate commands being placed by the user via PuTTY and the button panel while analyzing the user's audio input in real time. On the lowest design level (hardware), we have microphone and a button panel to convert physical inputs by the user into analog and digital signals the MCU can react to. On the highest level, PuTTY displays the operation status of the MCU and informs the MCU of user commands being placed at the command line. PuTTY also offers user the freedom to test the accuracy of our recognition and simulates a security system where the user must say a specific sequence of vowels to see a secret message.

Mathematical Theory
Vocal Formats

Basically, the first three formant frequencies (refer to peaks in harmonic spectrum of a complex sound) can attribute to the different appeal of vowel sounds.

Therefore, if we can pick out formant by intensities in different frequency ranges, we can identify a vowel sound and use sequence of vowel to generate an audio pass code specific to that vowel.

Frequency Transform Algorithms

The biggest difference between our analysis and musical intensity is that we need to adjust the frequency range stated above to better tell apart the difference between several peaks and combine all other information including amplitude. We need to decide which frequency transform algorithm is better to be used for a real-time audio addressing in both accuracy and computation speed. In fixed point DSP function in GCC, DCT, FFT & FWT are several common used algorithms. In our case, we chose Fast Walsh Transform over the rest simply because of its speed and its linear proportionality to Fast Fourier Transform.

The Fast Walsh Transform converts the input waveform into orthogonal square waves of different frequencies. Since we are only working with voice ranges here, we set the sample frequency to 7.8K which allows us to detect (ideally) up to 3.8kHz. We also knew that the lowest fundamental frequency of human voice is about 80-125Hz. Thus, we chose a sample size of 64 bit. This generates 32 frequency elements equally spaced from 0Hz to 3.8kHz (not including the DC component). The individual frequency division width is 3.8k/32=118.75Hz which gives maximizes our frequency division usage (since we could have useful information in every division instead of say a division width of 50Hz, where the first division does not provide useful information). Furthermore, this choice also minimizes our computation time since the more samples we have to compute, the more time it will take for the MCU to process input audio data.
MATLAB Simulation Results

In this part, most research we did were based on common vowel characters like 'A','E','I','O','U', which demonstrated that the method we attempt to develop could achieve. Yet in the real case, we found that the difference of these five characters is not as obvious as simply comparison between frequency sequency could distinguish.

We first use Adobe Audition to observe initial input waveform taken directly from Microphone and AT&T text2speech as shown in the picture. Although the waveform corresponding to the same vowel would result in a similar shape, there still exists difference which we may find more straightforward in frequency domain.

The first program in MATLAB is based on Prof. Land's code that compares the FFT and FWT outputs as spectrograms, then takes the maximum of each time sliced transform and compares these spectrograms. Top row is FFT power spectrum, FWT sequency spectrum is in the bottom. The maximum intensity coefficient of each spectrogram time slice in FFT and FWT are almost in the same shape. We'll take one spectrum as an example.

Another program directly implements FFT and show a frequency series. In this figure we can clearly see the resonance peaks of a vowel. This transform is 256 points. Also, notice that because of noise interference, it would be hard to tell apart the second peak for [EE] and this is not the only case.

Hardware/Software Tradeoffs

Due to the limited precision of our Fast Fourier Transform, frequencies that differ by a value that is less than the width of our frequency division are often not distinguished. When dealing with boundary frequencies, this was a problem for us since the peak frequency did not always reside in the same frequency division. To improve upon this, we used multiple divisions but we still had errors since we cannot consider every possible boundary case. We improved upon this further by boosting the gain of our op amp from x10 to x100. This boast gave us a much better summary result and reduced our error. However occasionally, we still have errors that stem from the precision of our analysis tool.
Relations to IEEE Standards

The only standard applicable to our design is the RS232 serial communication protocol. We used a MAX233 level shifter and a custom RS232 PCB designed by Bruce Land.
Relevant Copyrights Trademarks and Patents

The mathematical theories for frequency analysis of audio signals were obtained from both discussions with Bruce Land as well as R. Nave's webpage from Georgia State University.
Hardware Design
Overview

Our hardware setup consists of a prototype board that hosts the MCU and three other individual function panels. Audio input is processed through the amplifier panel and sampled by the MCU. The RS232 board allows the state of the MCU and user input to be sent to the PuTTY. The last function panel allows user to change the current operating mode of the MCU and enables the audio input to be sampled or discarded in the main program.
ECE 4760 Custom PCB

The prototype board we used is the mega644 prototype board design by Bruce Land. The printed PCB layout is shown below. The only port pins used (soldered) are C0,C1,C7 and A0. We also had to solder RX and TX pins to enable RS232 serial communication.

Audio Amplifier Circuit Panel

The amplifier circuit used is shown below. The microphone used in this project uses a charged capacitor to detect vibrations in air. As a result, R2 provides the DC bias necessary to maintain the charge across the internal capacitor of the microphone to allow sounds to be converted to electrical energy. R1 provides the DC bias that the microphone needs to remain functional. The LM358 op amp operates as a 2 stage signal filter. The negative voltage input of the op amp is first passed through a low pass filter with a time constant of approximately 42microSec. This is the upper bound of our audio signal filter since normal human speech lies between 85Hz to 3kHz. The value of the parallel resistor/capacitor feedback circuit is determined by the gain we wanted to achieve. In our case, we chose the amplifier gain to be 100, thus the value of R4 we chose is 1M Ohms (100 times R3) and C2 is chosen to compliment the operating frequency of less than 3kHz. The output of the op amp is fed to the analog compare input of the MCU (PINA.0) and analyzed at a rate of 7.8kHz.

RS232 Serial Communciation Panel

To better user experience with our design, we decided to use serial communication via PuTTY to inform the user of the current operating state of the program. We also felt that it would be a lot more convenient if the user could control the program via both hardware as well as software. To enable serial communication between the MCU and a PC, we used the Max233CPP level shifter chip and the RS232 connector PCB Bruce had designed. The transmit and receive pins on the MCU prototype board are connected to the TR and RC pins on the RS232 board to enable communication.

Hardware User Interface (Control Buttons)

Our hardware user interface consists of 3 buttons and 3 indication lights. When a button is held down, the corresponding LED will turn on. In our setup, pressing the green LED/button combination displays the results summary on the PuTTY screen. Pressing the yellow LED/button combination begins the sampling the MCU. Analysis on the input data is performed when this button is released (see code below for details). Each button is signaled by a pin in PORTC. The red LED/button combination allows the user to change the operating mode of the program between testing mode and decoding mode. The switch/LED circuitry for 1 button is shown to the right. When the button is pulled low (i.e. pressed) the LED is connected to ground via a 1k resistor and lights up.

Software Design
Overview

Overall, the program is broken down as follows:
Serial communication-take user input & display result
Button state machine-for test/demo mode user interface
FWT transform analysis-provide frequency ranges of input waveform
Vowel recognition-analyze vowel spectra and identify
Decode-check whether input vowel sequence is what we expected
Initialize

There are 7 header files and 1 source file included. "uart.h" & "uart.c" are used for UART driver. specify basic function for GCC compiler. provides context for handling interrupts. provides interfaces for a program to access data stored in program space of the device.

Variables initialized can be divided into four modules. One for FWT algorithm contains fixed number, size of FWT, frequency range specification. One for button state machine defines timer and button state. The third and fourth modules consist of characteristic specification, vowel counters, passcode setting, and passcode comparison. Besides, in the void initialize() function, UART is initialized and timer0 is set up to sample A/D. We also need to enable ADC, set pre-scalar, clear interrupt enable and prepare to start a conversion.

The sampling frequency we chose at the analog input is 7.8kHz. Since the highest voice fre quency is about 3.5kHz, 7.8kHz sampling rate gives us a reasonable upper bound.
Interrup Service Routines (ISR)
ISR(TIMER0_OVF_vect)

There is only 1 interrupt service routine used in this project: timer0 overflow ISR. Timer0 is used to read A/D converter and update appropriate buffer. We also use timer0 to maintain 1ms tick for all other timing in this program, including two button state machine and a monitor. Each button monitor program gets executed every 30 machine cycle, this was purposely chosen to make user interactions run noticeably smooth.
Task Breakdown
FWT transform to sequency order

This part is based on Prof. Land's code, which attempts to light up LEDs corresponding to which sequency bands have the most energy. The program takes a 64 points FWT as a sample rate of 7.8 kHz, throws the DC component, adds the absolute value of the sal and cal components. This gives us 33 frequency divisions equally spaced from 0Hz to 3.8kHz. The value stored in each frequency division is proportional to its energy level in the sampled data. This is used later to identify the characteristic peaks (most energetic frequency component) of vowels.

voidFWTfix(int x[])

This function does forward transform only, and useFWTreorder() to put the transform in sequency order.
voidFWTreorder(int x[], const prog_unit8_t r[])

This function converts from dyadic order to sequency order.
void transform(void)

This function can be divided into two parts. The first part computes FWT transform, updates the FWT, generates a sequency order, combines sal&cal, an omits DC in ain[0]. We will talk about the second part later.
Serial Communciation
voidserialComm(void)

This part takes user input and defines all the system behaviors in user interface including: s=set passcode; r=reset current entry; p=print passcode stored; t=enter testing mode. As the whole system has two modes (test/demo), and three processes (passcode setting->vowel recognition->decoding). It first checks if system is current in the state of waiting for entry. Then for "s" command, it calls intstringtoint(char)to store corresponding number into an array passcode[ ], in which ah: 1; oo: 2; oh: 3; ae: 4; ee:5. When "r" is entered as command, system cleared current entry (result[]) and let the user decode again. Command "p" would print out current passcode for testing. We also add a testing mode for verifying the vowel recognition preciseness.
Button State Machine

This part is responsible for describing all the states selected button may go through involving MaybePush, NoPush, Pushedand executes 30 machine cycles. Every time when called by main, it resets task timer, detect push state and change push flag.

Notice that we had set three buttons marked by green, yellow, & red LEDs, and for each of them, we have a button state machine.
voidbuttonMon(void)

This state machine corresponds to PINC=0xfe and time1. When yellow button is pushed, the system is informed that a new audio extraction process is started. The function clears all previous entries and set PushFlag to 1.
voidplayBackMon(void)

This state machine corresponds to PINC=0x7f and time2. When green button is pushed, the system is informed that all the audio extraction process is completed. The function analyzes all previous entries and set playBackFlag to 1.
voidstateMon(void)

This state machine monitors current device operation state. When red button is pushed, the system is informed that user needs to switch to a demo mode and begin a passcode debugging.
Vowel Recognition
void transform(void)

This second part of this function uses 3 predefined characteristic vowel peaks ranges for each vowel and calculates three characteristics of input waveform. The following algorithm is based on experimental research. We first choose "ah", "oo", "oh", "ae", "ee" as passcode elements. This method is similar to but not the same as purely identifying vowels based on its ideal frequency peaks. This is because some of the vowels such as "ee" and "oh" have very similar transform results. If only the ideal frequency peaks are compared and analyzed, we cannot effectively identify what the user has said. Instead, we experimentally determined the frequency divisions that occur most uniquely to the particular vowel (shows up many times in the transformed signal but is rarely present in other vowel signals).Then for each element, we compare the analyzed FWT results with the peak ranges we have defined and we increment a corresponding vowel counter when that particular vowel has been detected. For example , the characteristic peaks we chose for "ae" locates in the 3rd, 5-12th,12-28th range of the sequence order, while for "oh", the range is changed to 4-7th, 8-11th, 21-31st.

In addition to determining the frequency ranges to use experimentally, we also had a threshold value that is used to compare with the amplitude of the first peak in FWT analysis. For any transform performed with maximum peak amplitude below this threshold, we discard the transformed result. This is because if the amplitude of the first peak is not high enough, we will not be able to detect the second or third peak since

amplitude of first peak > amplitude of second peak>amplitude of third peak.

if(max==3 && second>5 && second<12 && third<28 && third>12) aehCounter++;
if(max>4 && max<7 && second>8 && second<11 && third>21 && third<31) ohCounter++;
if((max==1) && second==5 && third>10 && third<14){
if(compare[max]>30) ooCounter++;
elseeeCounter++;
}
if((max==1 && (second>6 && second <10) && (third>=11 || third<=14))||(max==8 && second==24 && third>27 && third<31)) ahCounter++;
}

This function also lights up a LED as transforming signal.
intfindMax(int x[],int i)

This routine returns a maximum value in a following sequence.
intfindMin(int x[],int i)

This routine returns a minimum value in a following sequence.
int recognize(void)

We initialized 5 counters for vowels and calculate the possibility of each in recognition. The one with the maximum counter number is considered as the correct answer and returned by this function.
Decoding

A comparison between passcode[] & result[] is implemented to proceed decoding.
void display(int t)

This is an additional function called by Main in test mode, when vowel recognition is completed and we need a result display on screen.
Delegating Task Using Main Function

First, initialize() is called to establish register, port configurations on the MCU as well as to start timers and ISRs. Immediately after, it calls serial communication to begin our first command. Then we enter an endless loop which controlled the tasks to be executed based on the timer values and are responsible for resetting the timer values before each of the tasks are called. Note that for each loop, we have to detect whether a button is pushed and which mode the system is working under. In demo mode, a message is displayed on screen indicating that audio input can be started. Each time an audio is extracted, the system returns the recognition result of the vowel. After 5 times of extraction which is exactly the length of a passcode, the system checks whether the sequence is correct, displays the result and prints out "Congratulations" or "Decoding failed". While in test mode, the system just proceed with a simple extraction and recognition process.
Things we tried but did not work :(
Narrow the Search for Peaks Using MATLAB

We tried using MATLAB's fft function to identify the first 3 characteristic peaks of the vowels. We were hoping that by simulating the same voice waveform in MATLAB, we can be sure of the peaks we are looking for in our MCU program. However, the results were not satisfactory and the peaks produced cannot be easily distinguished. Furthermore, depending on the person speaking, the analysis results that came from MATLAB differed greatly. We later switched our algorithm by finding the most frequent element in the vowel's FWT output. Even though these 2 methods are very similar, in the case of boundary frequencies, the second method produced a much more reliable result.
Results
Execution

Despite the fact that we are processing and analyzing data in real time, the FWT analysis and summary were produced instantaneously after the release of the yellow button. There are no known errors associated with controlling the MCU operation via both PuTTY and the physical button panel. Even when 2 buttons are pressed at the same time, the system will sequentially execute the valid commands.

Here are some screen shots of our system during operation:

When the system is turned on, our MCU automatically enters the default testing mode. At the same time, PuTTY will display a welcome screen informing the user that the system is ready to take inputs.

In the testing mode, the user is able to see the summary results of only saying 1 vowel. Audio input is processed when the user holds down the yellow button while speaking into the microphone. When the yellow button is released, the program will automatically compute the FWT and display the prediction in PuTTY.

To leave testing mode, the user can just press and release the red button once. As shown below, PuTTY shows that the program has exited testing mode and entered the decoding mode.

In the decoding mode, the user can set a sequence of 5 vowels as the system password and repeat the same sequence via the microphone while holding down the yellow record button.

If the user accidently entered the reset command before setting a password, the system will inform the user that there is no valid password being stored at the time. The user should set the password first by entering 's' to the command line.

New password is entered by inputting the vowel sequence with commas separating each vowel input. Once entered, the system will display the entered result and automatically enter recording mode where the MCU simply waits for user's audio input.

If the input audio sequence agrees with the stored password, then the congratulations screen will appear along with the secret message.

Anytime there is a command prompt at PuTTY, the user can choose to reset his/her current audio input by entering 'r'. This erases all of the audio inputs stored so far in the system and allows the user to re-record the password again.

Anytime there is a command prompt at PuTTY, the user can see the stored system password by entering 'p' for print. The system will display the entered vowel sequence.

The user can reenter testing mode from decoding mode by entering 't' at the PuTTY command line.

Here are 2 videos demonstrating our system at work
Performance

We originally designed our program to decode female voices. However, when we tested our system, we discovered that it decodes male voices (of much lower fundamental frequency) just as accurate as it decodes female voices. However, due to the limited precision of the FWT we implemented, in cases where the frequency peaks are near our predefined characteristic peak value for a vowel, errors occasionally occur.

We tested our program with a couple of our friends and for a male voice, the program is able to accurately predict the vowel said 49/50 times and for female voices, the program is able to accurately predict 45/50 times. Furthermore, the program only accurately recognizes vowel is the user is consistent in speaking (no accents or instability during recording).

We also found that the MCU tend to confuse between "OO" and "EE" or "OH" and "AE". In the case of "OO" and "EE", the waveforms are very similar. In the FWT output, both vowels have peaks that often overlap. In our program, "OO" and "EE" are determined by the maximum amplitude obtained in the transform. In normal speech, "OO" is louder and "EE" (see below for waveform comparison). This explains why MCU mistakes one for the other.

In the case of "OH" and "AE", FWT of the input waveforms produce almost the same first and second peaks. The two vowels are distinguished mainly based on the location of the third peak. However, the amplitude of the third peak is relatively low and can be easily mixed up with noise. Thus, predictions made about "AE" and "OH" and differ greatly depending how the speech was formulated.

Here are some test results we got using our system:
MCU Confusion for\Expectation ah oh oo ae ee
Female -- -- -- oh oo
Male oh ae -- -- --
MCU's Prediction Accuracy 95% 90% 95% 94% 90%

Safety and Usability

The system that we have designed can be used as a basis for implementing speech recognition since speech consists of vowels and consonants that can be identified using frequency analysis. An example of possible implementation in the real world would be using speech recognition in security systems, something that could be more convenient than entering passwords on a keypad to people with less proficient vision.

Furthermore, our system is simple and easily to handle. The only precaution in using our prototype system is the user must be careful in touching the PCB and port pins to minimize ESD hits.
Conclusion
Expectation

Overall, our final project result achieved all of the goals we defined in our project proposal. Our speech recognition system is able to accurately identify the vowel user has said. We extended this implementation and simulated a security system where the user must say the vowels in a particular sequence to be able to decode a secret message.
Future Improvements

There are two improvements that can be made in the future. The first one is based on current preciseness. Although our system in recognizing five vowels is clear and fits most users, we still need more experimental simulation to help identify characteristics and narrow the search for peaks in vowels. Furthermore, we may develop more characteristics for other vowels and even those not-vowel sounds. This also requires more research on vowel classification.

Another progress would be more complicated word recognition if we've done all the vowel identification. Since we are already able to tell apart five basic vowels, and for most words, an interesting thing is that you can pronounce and classify them with a sequence of basic vowels. For instance, the waveform of "yes" is similar to [ae] and [ee], while when one pronounce "no" , you can simply tell it apart by [oh]. Besides, for a word like "starbucks", the [ah] sound is obvious inside. Yet the difficulty for a word recognition even just based on sequence of vowels is that our current method is based on a cumulatingpossibility which is not strictly corresponding to time. In this way, the algorithm we implement in decoding is no longer useful. We may need to consider another cumulating possibility for all the words that may sound alike and mark the most possible one as result. This may lead to big challenge in accuracy.
Ethical Considerations

We have done our best effort to conform to the IEEE code of Ethics in the design and execution of our project. The FWT algorithm we used in our design was written by Bruce Land. The button state machines we used were modified code from previous lab exercises. There are no known systems using a similar MCU in implementing vowel recognition system. In fact, much of the speech recognition systems available in the world today require a lot more computation power than the mega644 and is able to analyze much more complex voice inputs.

We are honest in reporting the result of our system and our summary results are as accurate as the precision and real time computation capability the MCU can allow.

This system can be used to implement a security system with speech recognition. This can be potentially more convenient for people with less proficient vision than a keypad security system. Furthermore, with more computation power, our system can recognize individual voices and much more complex voice inputs.
Legal Considerations

Our project is a simple audio input and addressing device. It would not cause any interference with other devices and won't result in any violation of regulation.
Appendix
Appendix A: Commented Code

Download the commented code here.
Appendix B: Schematics

Download Prototype Board Schematic for Mega644 here.
Download Audio Amplifer Schematic here.
Download Serial Communication Panel Schematic here.
Download Switch Circuit here.
Appendix C: Parts & Costs

Parts Name Quantity Source Cost/each Cost
Mega644 1 ECE 4760 Lab $8 $8
Solder Board 2 ECE 4760 Lab $1 $2
RS232 Connector 1 ECE 4760 Lab $1 $1
MAX233 1 ECE 4760 Lab $7 $7
Microphone 1 ECE 4760 Lab Free Free
LM358 Amplifier 1 ECE 4760 Lab Free Free
Push Button 3 ECE 4760 Lab Free Free
LED 3 ECE 4760 Lab Free Free
Jumper Cables 10 ECE 4760 Lab $1 $10
Header Pins 14 ECE 4760 Lab $0.05 $0.7
Resistors Several ECE 4760 Lab Free Free
Capacitors Several ECE 4760 Lab Free Free
Power Supply 1 ECE 4760 Lab $5 $5
Custom PCB 1 ECE 4760 Lab $4 $4
Dip Sockets 2 ECE 4760 Lab $0.5 $1
Total Cost

$38.2

Appendix D: Work Division

Average workload: 20 hour per week
Theoretical analysis: Annie Dai, Youchun Zhang
Hardware setup: Annie Dai
MATLAB simulation: Youchun Zhang
Code debugging: Annie Dai, Youchun Zhang
Report: Annie Dai, Youchun Zhang
Appendix E: References

Download Mega644 datasheet here
Professor Land's FWT test code regarding how to find the maximum component of each sequence and displays the corresponding LED on STK500
Previous ECE Project: Musical Water Fountain
R. Nave's page on vowel formants here
Appendix F: The Authors

(top)Youchun Zhang, (bottom) Annie (Wei) Dai

FaceAccess -- A Portable Face Recognition System

"A standalone face recognition access control system"

project soundbyte

We created a standalone face recognition system for access control. Users enroll in the system with the push of a button and can then log in with a different button. Face recognition uses an eigenface method. Initial testing indicates an 88% successful login rate with no false positives.

There are currently commercially available systems for face recognition, but they are bulky, expensive, and proprietary. Our goal was to create a portable low-cost system. Our design consists of an Atmel ATmega644 8-bit microcontroller, a C3088 camera module with an OmniVision OV6620 CMOS image sensor, Atmel's AT45DB321D Serial Dataflash, a Varitronix MDLS16264 LCD module for output, a 9-volt battery, and a small wooden structure for chin support.

High Level Design

Our design is split into three different processes: training, enrolling, and logging in.
Training

The training process is the only time we use a computer; once this step is complete, the system is completely standalone. If we were to sell this system as a consumer product, we would ship the system pre-trained. The training process consists of teaching the system to key in on the most important features of a face. To do this, a large number of facial images are taken and sent to Matlab to help the system determine the distinguishing features of a face. We use Matlab to create the eigenfaces, which are the principle components of the training set (SeePrincipal Component Analysis). Because the images were too large to be held on the microcontroller, we captured an image a line at a time, sending each line to flash before the camera sends the next line. We then send the image through the microcontroller to Matlab over the serial port. Once all of the training images are in Matlab the eigenfaces and average face are created (see Background Math for more information). Finally, the eigenfaces and average face are sent to flash memory through the microcontroller. Once the eigenfaces and average face are in flash, the system is completely standalone.

Enrolling

Before a user can login to the system, he first needs to register his face and enroll it into the database. To enroll into the system, the user presses the enroll button, which will capture the image. Before the image is captured, the current number of system users is checked; if the maximum number of users is met the new user cannot be enrolled. We set our maximum to 20 users. If the maximum hasn't been met, the user's face image is captured and is once again sent to flash memory 176 bytes at a time (same as in the training process) and is stored there temporarily for calculation. The image and eigenfaces are all pulled back to the microcontroller 528 bytes at a time to calculate the new user's "template", which is a short vector describing the user's correlation with the eigenfaces. (see Background Math for more information). The template is then compared with the previously stored templates; if the new template is too close to a previous template, the user cannot be enrolled. We defined the "closeness" between two templates as the cosine of the angle between them. If there are no matches, the new template is added to the database in flash memory to save it in case of a system reset.

Logging in

The Logging In process is initially very similar to enrolling. The user presses the login button to take his picture, store it in flash memory and begin the logging in process. Again the newly captured image and eigenfaces are pulled from flash back to the micrcontroller 528 bytes at a time to calculate the user's template. This template is compared with all of the previous templates. For the user to be logged in, their template needs to "match" (be close enough to) onlyone saved template; otherwise they will be "denied access". Again, the cosine of the angle between the two templates is used to determine template match. Whether or not the user was logged in, the top three matches are displayed on the LCD.

Erasing Users

We currently only have the system setup to have up to 20 enrolled users. However, for demo purposes we wanted the ability to enroll new users and erase the old ones. To do this we added a final button on the back of our protoboard for this purpose. The button needs to be held down for an entire second before the function to erase templates is called. A message is displayed to the LCD to let the user know that the enrollments have been erased.

Standards Used

For communicating with flash memory we used the Serial Peripheral Interface (SPI). For programming the camera we used the Inter-Integrated Circuit (I2C) interface. The video signal was digital, so it was not NTSC, but understanding the NTSC standard helped us understand the camera output.
Intellectual Property

The eigenface method is in the literature, so we have no intellectual property concerns. We plan to try to publish our results, so we are fully disclosing our design.
Background Math

When one thinks of face recognition, one immediately thinks of finding features of a face: eyes, nose, ears, cheek bones. But who's to say that these are the most distinguishing characteristics of a face, and that they are the best features by which a face should be described? And what if these features are correlated? Instead of hard-coding features for detection, we decided to find the orthogonal features that most optimally describe our large training set by using Principal Component Analysis. This creates an orthogonal basis of principal components for our training set. The basis vectors are known as eigenfaces, and can be thought of as characteristic features of a face. All new faces will be described as a linear combination of these eigenfaces. This is equivalent to projecting the new face onto the subspace spanned by our eigenfaces. By using only the eigenfaces with the highest eigenvalues, we ensure that our projection maintains the most face-like energy from the image. This method was proposed by M. Turk and A. Pentland in 1991.

The system is divided into three different processes: Training, Enrolling, Recognizing.
Training

The training portion of the system consists of creating the eigenfaces based on a set of training faces. The goal was to create as many eigenfaces as possible and keep the M eigenfaces that had the highest eigenvalues.

Given N training images, create a matrix where each column is a face vector of length 176 * 143 = 25168.

We used a total of N=40 training faces to create the eigenfaces. Below are the raw images captured (before any processing) of some of our volunteers; not all of these faces were used to create the eigenfaces.

To create the eigenfaces we use the following algorithm:
Normalize each face
Calculate the mean and standard deviation of each face. This is like brightness and contrast.

Normalize each image so that each face is closer to some desired mean and standard deviation

Calculate the mean image based on the M normalized images.

Calculate the difference matrix using the new mean face.

Find the eigenvectors of D * D'. However, this is too large of a matrix so it can not be done directly.
Find the eigenvectors v of D' * D.

Multiply the resulting eigenvector by D to get the eigenvectors of D * D'.

Take the eigenfaces that correspond to the M eigenvectors with the highest value.

Below are 28 of the 30 eigenfaces we created. For enrolling purposes we use only the top 25 eigenfaces.

Enrolling

The enrollment process consists of creating a user "template" for comparison when he tries to log in again later. The template is an M-length vector (M is the number of eigenfaces used to create the face space) that represents the correlation of the user's face with each eigenface. We use M = 25.

To create a user template from the 176x143 face image, we use the following algorithm:
Follow the same steps as above to normalize the new face image to the desired mean and standard deviation used in the eigenface calculation.
Create the "difference" face by subtracting the average face from the normalized new face
Dot the "difference" face with each eigenface. The value of the ith dot product is the ith element of the template. T is the template, F is the difference face, and E is the eigenface.

Logging In

When a user attempts to log in, a new face template is created from the newly captured image as described above. This template is then compared with every other saved template. The measure of correlation we use is the cosine of the angle between the different templates. If the correlation between two templates is above the desired threshold then a "match" has been found. After testing, we decided to use 0.85 as a threshold. This was small enough to reduce false negatives while high enough to eliminate false positives.

To log in a user with their new 176x143 image, we used the following algorithm:
Follow the same steps as the enrollment process to create a new template, T_new
Calculate the correlation of T_new with all of the stored templates to find the number of matches.

Below is a visual example of the projection of a face onto our eigenface space. The image on the left is the image captured by the C3088 camera, and the image on the right is the reconstructed linear combination of eigenfaces defined by the user's template. This person was not used to create the eigenfaces. We never actually create this reconstructed image on the microcontroller. We only store the template.

Hardware and Software

Circuit Board attached to wooden structure

Atmel At45DB321D Flash Memory

We fastened all of our hardware together for portability. A wooden structure with a chin plate is used to hold the camera in place and to normalize the position of a user's face during image capture. An LCD is fixed to the top of the wooden structure to provide system feedback to the user. A printed circuit board connects all of the electronic components including the microcontroller, flash memory, and all three buttons. The board layout was designed by Bruce Land for ECE 4760. We reused the board from a previous group. The board is attached to the side of the wooden structure so the user can easily find and press all of the buttons.
Flash Memory and the Serial Peripheral Interface (SPI)

The ATmega644 only has 4 kB of RAM. This is not enough to hold even one 176x143 image. We need to store a possible 50 images for the eigenfaces. For external memory, we use the AT45DB321D serial dataflash from Atmel. It stores 2 MB in 4096 pages of 528 bytes per page. This chip requires a Vcc of 2.5 - 3.6V. We chose 3.5V so that logic high from the flash would be read as logic high on the microcontroller, which has a Vcc of 5V. We used the voltage regulator circuit to the right to supply 3.5V to flash.

In addition to main memory, the chip has two 528-byte buffers. You can read from buffers or main memory. To write main memory, you must first write to a buffer, and then copy the buffer to main memory. The latter is an internally timed operation which takes about 3 ms. The speed of the former depends on how much data you clock into the buffer and at what rate.

Communication with flash memory uses the Serial Peripheral Interface (SPI), which is a full duplex serial communication protocol. The ATmega644 has dedicated SPI hardware which makes coding easy. We configured the microcontroller as the master and flash as the slave. The microcontroller outputs a clock on pin B.7 to flash through a 1k resistor. The microcontroller outputs data on pin B.5 to flash through a 1k resistor. Flash outputs data to the microcontroller on pin B.6 through a 330 ohm resistor. Chip select is pin B.3. Brian wrote a general purpose library called flashmem.c for communicating with flash. A description of its functions is below.
void spi_init(): Configures the microcontroller SPI parameters.
unsigned char readStatusRegister(): Returns the contents of the flash’s status register.
unsigned char isBusy(): Returns the most significant bit of the flash’s status register. The assertion of this bit indicates the flash is busy.
void writeBuffer(unsigned char data[], unsigned int n, unsigned int byte, char bufferNum): Writes the first n bytes in the data array to buffer 1 or 2, depending on bufferNum. The data will be copied to locations [byte : byte+n-1] in the buffer.
void readBuffer(unsigned char* data, unsigned int n, unsigned int byte, char bufferNum): Reads bytes [byte : byte+n-1] in the buffer and copies it into the data array.
void bufferToMemory(unsigned int page, char bufferNum, char erase): Copies the data in the buffer to a page in main memory. If erase==0, this page must have been previously erased for this operation to work. After this command is sent, the flash will be busy for about 2.5 ms. If erase==1, this function will first erase the page, and then copy the data. This will take about 25 ms.
void readFlash(unsigned char* data, unsigned int n, unsigned int page, unsigned int byte): Reads bytes [byte:byte+n-1] at the page specified in main memory. The result is copied into the specified data buffer.
void eraseBlock(unsigned int block): Erases pages [8*block : 8*block+7] in main memory.
void eraseFace(unsigned char k): Erases blocks [8*k : 8*k+7] in main memory.

Memory Layout

For indexing ease, we thought of main memory in terms of "faces" of 64 pages each. We were having some issues with pages and blocks of pages being randomly corrupted and erased, so we needed to move around the eigenfaces from their original pages in memory. Currently, Face 51 is used as the temporary space for the newly captured image. This face is erased immediately after the user's template has been created and analyzed. Faces 7 through 36 are used to hold the eigenfaces, not in value of importance. There is an index array in the final FaceRecSystem.c to index correctly into these eigenfaces. Finally, the Mean Face is stored in face 37.
Camera Module and the Inter-Integrated Circuit (I2C) Interface

We used the C3088 camera module for image capture. This module uses OmniVision’s OV6620 CMOS image sensor. Camera settings are programmable through a standard Inter-Integrated Circuit (I2C) interface, which is a two-wire, half duplex interface. The camera outputs VSYNC, HREF, and PCLK control signals and an 8-bit parallel digital video output Y7-Y0.

I2C Interface

As with SPI, the ATmega644 has dedicated hardware for I2C. We used an I2C library written by Peter Fleury to create two general purpose functions for the C3088:
unsigned char camera_read(unsigned char regNum): Returns the contents of the camera register regNum
void camera_write(unsigned char regNum, unsigned char data): Writes data to the camera register regNum

We mostly used default settings for the camera. Apart from the default settings, we chose to:
Reduce frame size from 352x288 pixels to 176x144 pixels.
Reduce frame rate from 60 frames per second (fps) down to about 5 fps.

Video Signal

The OV6620 image sensor outputs an 8-bit parallel gray-scale digital video signal on pins Y7-Y0. It also outputs PCLK, VSYNC, and HREF for timing. The format is similar to the analog NTSC standard. See the timing diagram below. PCLK is a 300 kHz clock. A 4 millisecond (ms) pulse on VSYNC indicates the start of a new frame. After this, data is output row by row. Every row, HREF goes high and data is subsequently sent out on Y7-Y0, one pixel per rising edge of PCLK. HREF is high for about 0.6 ms, during which time 176 pixels are clocked out. HREF is then low for about 1.4 ms before starting the next row.

Image Capture

One of the biggest technical hurdles at the beginning of the project was image capture. The ATmega644 only has 4kB of RAM, and a 176x143 pixel gray-scale image is over 25 kB. Thus we were forced to read the video signal and write it to flash simultaneously. To achieve this, we ingest one line (176 pixels) from the camera when HREF is high, and then write it to the flash buffer when HREF is low. Writing a line to the buffer takes about 0.9 ms. Every 3 lines, the flash buffer fills up and we command flash to copy the buffer to main memory. This is an internally timed operation which takes about 3 ms. By alternating buffers every 3 lines, we are able to keep up with the video signal. See the timing diagram below.

Interfacing and Testing with Matlab

We decided that the best way to view the images captured by the camera was to send the captured image from flash memory to Matlab. The easiest way to communicate with Matlab is to use the serial port and the uart as defined by Joerg Wunsch in uart.c and uart.h.

Interfacing with Matlab proved to be a little more complicated than we originally expected because of the amount of data we were trying to transfer at once. The SerialConnect.m Matlab file was split into three separate sections: setting up the connection, receiving data, sending data. The serial connection was setup based on the COM port the microcontroller was connected to as well as the desired baud rate (which must match definition in uart.c). The receive portion of the script is based on receiving chunks of data 66 elements at a time or 528 bytes; we found that this is the maximum amount of data that could be sent to Matlab over the uart in one buffer. The send portion of the script prints to the uart the number of "faces" it is sending, where in flash memory the data is supposed to go, and then the data is sent one byte at a time. We found that if we sent more than one byte at a time and didn't pause in between print statements the microcontroller wasn't able to read in. This unfortunately slowed down flash programming significantly. It took about 3 hours to load all the eigenfaces onto flash.

A MatlabLib.c file was also written to set up the data transfer on the microcontroller side. It was also split up into three different functions. The first is a set of print statements that sets up Matlab's read program to expect the correct amount of data depending on the number of faces and the size of each face being sent over the uart. Another function called dumpFrame() splits up the data and sends it over the uart in blocks of 66. The final function, called readFrame() scans for the amount of data that is being sent over the uart and then does individual scans until it reads all of the data.

These two uart interfaces were used in many of the different processes of the project to move data back and forth between the camera, flash memory, and Matlab. In addition to being necessary to calculate the eigenfaces and send them to flash, this was a very useful debugging tool. There are a number of functions throughout the different C files that use the interface to send data.
sendFacetoMatlab(unsigned char face): read face 'face' from flash a page at a time and sends it over to Matlab. WE specifically used this to send test images from the camera and previous writes into Matlab to check their correctness.
pageMatlabFlash(int pagNum): reads one page from Matlab and sends it to page pagNum in flash Memory.
faceMatlabFlash(char face): sets up the page numbers for a specific face so that pageMatlabFlash can send and receive the correct data to flash memory.
Problems

Our biggest problem was that every so often, our eigenfaces would get corrupted in flash memory. At first we suspected ESD, so we covered flash in an ESD-free bag when transporting our device. We also suspected faulty wiring, so we re-soldered flash, simplifying the wiring and eliminating loopy wires. We also slowed down SPI by a factor of 2. Perhaps some of these things helped since we haven't seen any more corruption, but we have also begun to handle flash with more care, so it is not clear what the problem was.

Results
Accuracy

We tested our face recognition system on 12 people. We conducted 25 normal login attempts, 22 of which were successful. The three unsuccessful login attempts (i.e. false rejections) were barely under the threshold for acceptance. There were zero false acceptances. This was a strict design constraint. A physical access system can have a 10% false rejection rate but should never recognize user A as user B. False rejections are resolved by trying again, but false acceptances usually mean a security breach. Selected test results are below.

We also conducted 10 login attempts with added variables to test the limitations of our design. Such variables included hair modification, glasses, smiling, sticking out tongue, head tilting, etc. We found that our system was sensitive to hair, head tilt, and glasses, but not as sensitive to smiling or sticking out tongue. Selected test results are discussed below.
User 7 Normal Login

The plot below shows 11 login attempts by User 7. In all 11 trials, User 7 correlates most closely with herself. Nine of these logins were correlated enough to be successful. Two were just under the threshold. In her last login attempt User 7 was sticking out her tongue, but her correlation was still above the threshold.

Login Attempts by User 7
User 7 Login with Hair Modification

The plot below displays a limitation of our design. Our system is sensitive to hair arrangment. The plot shows 4 more login attempts by User 7, but this time she let her hair down before trying to log in. Her template in the database was created with her hair pulled back. Letting her hair down brings her correlation well below the threshold.

Login Attempts by User 7 with her hair let down
User 8 Normal Login

The plot below shows 5 login attempts by User 8. Four were successful. We suspect that the failed login (attempt 3) was due to User 8 slightly tilting his head during image capture. This is indicative of a limitation of our design. Our system is sensitive to changes in head orientation in both pitch and roll. If we had time, we would have expanded our structure to include a place to rest the forehead. In Login Attempt 5, User 8 was smiling, but the systems still recognized him.

Login Attempts by User 8
Glasses

User 11 enrolled with his glasses on and then failed to log in with his glasses on in three login attempts. The correlation was around 0.60. He removed his glasses and re-enrolled. After this, he could log in (without his glasses) repeatably. This indicates that our system is very sensitive to glasses. This makes sense since none of our training images included glasses. Our face space was chosen to exclude glasses features, so the projection onto the face space of a face with glasses is rather sensitive.
No Face

We performed three tests where we took pictures of non-face objects: hand, water bottle, and background only. In all of these tests, login was unsuccessful.

Login Attempts with no face in the picture
Speed

At the beginning of our project we identified run time as a potential problem. The major contribution to run time is calculating the unknown user template. A template is really 25 dot products, one for each eigenface. Each is a length 25168 dot product. This is 630,000 multiplies. In addition to this, we need to load all the eigenfaces from flash to the microcontroller over SPI running at about 100 kB/s. By doing integer multiplications instead of floating point, we were able to bring the run time down to about 15 seconds.
Safety

There are only two possible safety concerns in our design. One is that we ask you to touch your chin to the chin plate. There is a possibility of spreading germs. In a real system, we would probably make this plastic instead of wood, and provide disinfectant to clean the chin plate. The other possible safety concern is that we have some sharp exposed ends of screws. They are not near buttons, the chin plate, or places where you would grab our device to move it.
Usability

Given more time, we would have tried to find more optimizations to reduce runtime. That being said, 15 seconds is a reasonable time for an access control system.

You only need to keep your chin on the chin plate while the image is being captured. This is less than a second. We found that people were confused about this because they couldn't see the LED. The LED is on and only on during image capture. Given more time, we would move the LED to a more visible location.

Since we have a portable system, it can conform to people of all sizes.

Conclusions
Results vs. Expectations

Our results were better than we expected. We were able to show a reasonable login success rate (88%) with reasonable run time (15 seconds). For this we were extremely satisfied. It was a little disappointing that our system was so sensitive to hair modifications, head tilt, and other sometimes unidentifiable variables. Some of these are limitations of the eigenface method.

If we were to repeat this project from scratch, we would do a few things differently. First, we would improve mechanical structure to include a forehead rest or some other head position normalizer. Second, we would use something other than serial dataflash. This could improve run time and would have prevented the data corruptions that plagued our last week. If we improved run time, we could use more eigenfaces, further improving our successful login rate and making our system more robust. We would have also liked to use more training faces, but it was difficult to find more than 50 people in a few days to train our system with.
Conforming to Standards

We were able to communicate with flash memory and with the camera. This means that we successfully conformed to the SPI and I2C standards.
Intellectual Property

Our design poses no intellectual property problems. The eigenface method is well known in the literature. All aspects of our design that we didn't develop ourselves are properly cited in our references below.

We will try to publish our design so everything we did will be in the public domain.
Ethical Considerations

We strived to abide by the IEEE Code of Ethics during this project.

Facial recognition brings up a number of ethical issues. For many people, the mere idea of automatic facial recognition hints at George Orwell’s “Big Brother is watching you.” However, we do not see any ethical issues for our project since all participants are willing and there is no passive recognition. That is to say that no one is being recognized against their will, and every instance of recognition is initiated by the user. It’s not as if we are creating a real-time video facial recognition system using a large public database. No user can be recognized without first registering.

Another concern with many biometric systems is that the biometric template is stored in a database that could be compromised. Even though the captured picture is stored in flash during enrolling and logging in, it is erased as soon as the template is created. It is only stored in flash memory for 15 seconds. The templates themselves do not store the actual face, and the problem of recreating the face from the template and the eigenfaces is underdetermined and cannot be solved. These measures protect the user's identity from a data compromise.
Legal Considerations

Our project is a simple standalone device. It does not emit any appreciable EMF radiation and does not cause interference with other electronics. We made sure to get everyone's permission before publishing their names or images on this website.

Appendices
A. Source Code

Source files
FaceRecSystem.c (19KB) – the code for enrolling and logging in users using the flow diagrams from the project overview.
camToMatlab.c (5KB) – the code used to take images, and transfer them from flash memory to Matlab. Used to load training images to Matlab through the serial port.
MemToMatlab.c (5KB) – the code used to test sending data from flash memory to Matlab through the serial port.
MatlabToMicro.c (6KB) – the code to move the eigenfaces from Matlab through the microcontroller to the flash memory through the serial port.

The following source files are the libraries we created or borrowed for the project.
camlib.c (4KB) – the library for setting up the camera and interfacing with it, written by Brian Harding
flashmem.c (8KB) – the library that sets up the flash memory and controls the reads, writes, and erases, written by Brian Harding
Matlablib.c (2KB) – the library that sets up the data to be sent to and from Matlab, written by Cat Jubinski
uart.c (4KB) – an implementation of UART, written by Joerg Wunsch.
twimaster.c (6KB) – the I2C master library, written by Peter Fleury
lcd_lib.c (9KB) – the library for the LCD, written by Scienceprog.com
Header files
camlib.h– written by Brian Harding
flashmem.h– written by Brian Harding
MatlabLib.h– written by Cat Jubinski
uart.h – written by Joerg Wunsch
i2cmaster.h – written by Peter Fleury
lcd_lib.h – written by Scienceprog.com
Matlab scripts
SerialConnect.m (3KB) – used to receive face images from flash memory and send eigenfaces back to flash memory. Written by Cat Jubinski
getEigFaces.m (6 KB) – used to create the eigenfaces and simulate enrollments and logins in MATLAB. Written by Brian Harding
efaces0501.mat (592 KB) – Eigenfaces used in the system.
meanface0501.mat (12 KB) – Mean Face used in the system

Download all files: Code.zip (44KB)
B. Schematics

Hardware schematic [full-size image]. Download schematic file: faceAccess_schematics.sch (36KB).
C. Parts List
PartSourceUnit PriceQuantityTotal PriceATmega644 (8-bit MCU) ECE 4760 Lab $8.00 1 $8.00
C3088 with OV 6620 (Image Sensor Module) ECE 4760 Lab $0.00 1 $0.00
AT45DB321D (Serial Flash) DigiKey $3.87 1 $3.87
Varitronix MDLS16264(LCD Module) ECE 4760 Lab $8.00 1 $8.00
PC Board + Power Regulator ECE 4760 Lab $0.00 1 $0.00
Wood & Screws Lowes $5.00 1 $5.00
9-Volt Battery Lowes $3.00 1 $3.00
Push Buttons lab lab 2 $0.00
Header Pins lab $0.05 18 $0.90
Total $28.77

D. Tasks

This list shows specific tasks carried out by individual group members. Everything else was done together.

Brian
Interfacing with Flash Memory (SPI)
Programming the Camera (I2C)
Building Mechanical Structure
Soldering the PCB

Cat
Sending Training Images to Matlab
Sending Eigenfaces to Flash
Designing Website
E. Pictures

Below are some pictures of us working in the lab. Photos courtesy of Bruce Land.

References

This section provides links to external reference documents, code, and websites used throughout the project.

Datasheets
ATmega644 (8-bit MCU)
C3088 (Camera Module)
OV6620 (Image Sensor)
AT45DB321D (Serial Dataflash)
LCD (Display)
Reference Code

We referenced preious projects image sensor code to create our camlib.c library.
CMOS Camera Rock Paper Scissors Game System
3D Scanner
Tic Tac Toe with CMOS Camera

Vendor Sites
DigiKey
Atmel
Background Info
Face Recognition Using Eigenfaces: Matthew Turk and Alex Pentland
Eigenfaces for Recognition: Matthew Turk and Alex Pentland
Drexel University Eigenface Tutorial

Acknowledgements

We would like to thank ECE 4760 Professor Bruce Land and all TA staff (especially our lab TA Tom Gowing), for help and support during the labs and over the course of the project. We thank them for the long lab hours and the parts stocked in the lab. Thanks to Adam Papamarcos and Eileen McIver for brainstorming sessions and technical help.

We would also like to thank everyone who "donated" their face for creating eigenfaces and testing our system:

Jon Altiero
George Barrameda
Michael Brancato
William Bruey
Daniel Charen
Bonnie Chong
Jimmy Da
Danielle Feldman
Cameron Glass
Katie Hamren
Daniel Hare
Henry Hinnefeld
Cindy Huang
Adam Jackman
Bruce Land
Kevin Martin
Eileen McIver
James McMullen
Aaron Miller

Thomas Mitchell
Joe Montanino
Elise Newman
Adam Papamarcos
Pouria Pezeshkinam
Stephen Prizant
Hyundo Reiner
Evan Respaut
Nick Rho
Gena Rozenberg
Michael Schwendeman
Ragini Shama
Alec Story
Kevin Ullmann
Roger Varney
Elizabeth Walker
John Wright
Jeff Yates

Free For All

Thursday, 25 August 2011

Voice decoder for vowels

FaceAccess -- A Portable Face Recognition System

Popular Posts

Labels

Blog Archive

Total Pageviews

Subscribe

Live Cricket Scores

Live Stats