Speech Recognition : The process of converting a speech signal ( Audio ) to a sequence of words.
Most Systems today use the “Hidden Markov Model”
- “Statistical model in which the system being modeled is assumed to be a “markov process” with unknown parameters, and the challenge is to determine the hidden paramaters from the observable parameters.”
- the “state” is not direcly visable, but variables influenced by the state are visible. A Sequence of tokens generated by a Hidden Markov Model gives information about the sequence of states.
Also can use “Artificial Neural Network”
Modern day systems us “Noisy Channel Formulation”
- The task of the recognition system is to search for the most likely word sequence given the acoustic signal. Ie, System is searching for the most likely word sequence among all possible word sequences
( W~ = Most likely word sequence, all possible word sequences = w*, the acoustic signal = A ).

W~ = arg maxWeW* PR(W|A)
Audio Visual Speech Recognition
- Technique that uses “image processing” technology to Lip read, to aid in speech recognition.
- Each system ( lip reading / speech recognition ) work separately, then the results are mixed at a later stage
Speech Synthesis
- Artificial production of human speech.
- Text to Speech
- Symbolic Linguistic Representations ie
- Phonetic Transcriptions
- Artificial production of human speech.
Voice Analysis
- The study of speech sounds for purposes other than linguistic content.
- Ie, analysis of vocal quality of medical patients
- speech therapy
- The study of speech sounds for purposes other than linguistic content.
Phonetics
- Sound is a series of pressure changes in the medium between the sound source, and the listener
- Oscillogram / waveform = most common representation, pressure increases / decreses the signale.
- Pitch Analysis = Another representation of a speech signal
- Speech is a physical process consisting of two parts
- Product of a sound source ( vocal chords )
- filtering ( tongue, lips, teeth )
- Pitch Analysis tries to capture the “Fundamental Frequency of the sound source” by “analysing the final speech utterance”.
- Fundamental Frequency – Dominating Frequency of the sound procuded by the vocal chords.
- Difficult to perform.
- Speech is a physical process consisting of two parts
- Spectrum
- Specrtum gives a picture of the distribution of frequency and amplitude at a moment in time.
- 3d graph required to plot time – ie , spectrogram
- Spectrogram
- Time = horizontal axis, frequency = vertical axis. Amplitude ( 3rd axis ) represented by shades of darkness.
- Voiced sounds appear more organised.
Some Links
“Speech Analysis Tutorial”
http://www.ling.lu.se/research/speechtutorial/tutorial.html
“The CMU Sphinx Group, Open Source Speech Recognition Engines” - http://cmusphinx.sourceforge.net/html/cmusphinx.php
“Praat : Phonetics by Computer”
http://www.fon.hum.uva.nl/praat/
“Speech Analyzer”
http://www.sil.org/computing/speechtools/speechanalyzer.htm
“lingWAVES : Signal Analysis”
http://www.lingcom.de/english/products/lingWAVES/lingWAVES_overview.htm









