How do humans hear the speech (5:25)- Introduction on how evolution did it:
An organ in our ear called cochlea has a specialized contribution to our auditory system. It is designed to be responsive to frequency and move variably specific areas along the basilar membrane in response to different frequencies of sound. Based on the area in which basilar membrane moved, different nerve impulses are triggered and informed the brain. A step in the process of extracting Mel frequency cepstral coefficients(popular features for ASR), called periodogram extraction, does a very similar thing.
Steps to prepare MFCCs:
- Split the audio signal into small frames of 20-40 ms, the standard is 25 ms.
- Calculate periodogram estimate(power spectrum) for each frame.
- Take clumps of periodogram bins and sum the spectrum inside to get the energy levels around different frequencies. We use Mel filterbank to do this. The Mel scale tells us exactly how to space our filterbanks.
- Take logarithm of filterbank energies. Humans don’t hear in linear scale as well.
- Compute the DCT of log filter bank energies. We do this to decorrelate filterbank energies which are quite correlated. We compress and pick only 12 or 13 coefficients.
Python libraries to extract MFCCs:
- Using state of the art LSTM recurrent neural networks
Detailed LSTM tutorial