The entire content of this article along with the images is copyrighted to © Uma Chandrasekhar 2023

Audio Analysis using Machine Learning- Part -3

Uma Chandrasekhar
10 min readDec 29, 2023

--

In this article, I will discuss the fundamentals involved in Speech recognition and music classification. I am not exploring any specific speech recognition or music classifier algorithm in detail. Instead, I mention the various algorithms used in the process, along with their salient features which distinguishes them from the other, as discussing all these algorithms is a such a wide topic, which is beyond the boundaries of this three-part series.

Speech Recognition

The basic roles of a speech recognition algorithms are

  • To identify the speech from the other sounds by using phonemes and identifying any impulse with a constant frequency of around 100 to 120 HZ
  • To identify the accent, emotions like interrogation, sarcasm etc. by altering the frequency of the impulse received from the speech.
  • To identify language of the spoken words
  • To identify the voice of the speaker
  • To identify the gender of the speaker using formant location. The male and female classification is done using first three formant frequencies, as the audio data collected and analyzed from both male and female speakers differ sufficiently enough at the formant frequencies. The formant frequencies are obtained by locating the peaks when the transfer function of the filter ( vocal tract) is applied on the sound impulses produced by vocal chords.

Phonemes

Phonemes are the fundamental components in speech recognition. Any successful speech recognition algorithm must have the best capabilities to identify phonemes.

Phonemes are defined as individual sounds that make up the words. Examples of phonemes are ‘ah’,’oy’, ‘ei’ and ‘s’. Phonemes are categorized based on the glottal impulses, which in turn, depends on the movement of epiglottis in the throat and the distance the impulses have to travel. The distance traveled by impulse is based on the location of our lips, teeth, or mouth. Each phoneme is identified using an auto regressive filter constructed to prevent other glottal impulses from passing through them. The resultant time domain representation of the identified impulses is represented as peaks and troughs in the amplitude in numerals (Y axis) and Time in seconds (X axis) graph.

copyrighted to © Uma Chandrasekhar 2023

Phonemes are divided into three major categories.

  1. Vowels and pseudo vowel consonants: There are two types-

a) Diphthongs — Combination of two vowels

b) Affricatives — Combination of two consonants

2. Fricative sounds: There are two types

a) Unvoiced — This is a hiss due to the flow of air steam, which lacks the impulsive resonances of vocal tract vibrations, which are typical of a voiced sound

b) Voiced — This is a combination of vocal tract formant resonances formed of glottal impulses and fricative hiss

3. Stop consonants: These are characterized by complete cessation of the air stream, identified using unvoiced and voiced stop constants followed by a stop glotter, which can be identified by a single click due to a sudden air stream through the glottis.

Speech Recognition Architecture

Speech recognition architecture has two categories

  1. Automatic Speech Recognition (ASR) — Speech to Text
  2. Text to Speech.

In this article, I will focus only on ASR.

In ASR, the three most important aspects which needs to be considered are

Speaking environment and microphone — Points to be noted in this are

  • Speaking in noisy environments such as office or home or public places like malls, movie theaters etc.,
  • Speaking through headset or one to one physical proximity
  • Speaking through landline or mobile phone
  • Speaking while standing far away from the microphone, while allowing noise and jitter in the audio waveform.

Style of speaking

  • The speaking style varies based on the person’s ability to utter words. The various classification can be isolated words, connected words, fluent speaking, continuous speech with limited vocabulary, spontaneous speech with unlimited vocabulary and ungrammatical words.

Speaker modeling

  • The speech modeling is of four types -Speaker dependent, Speaker independent and Average speaker models and Speaker adaptation models

The main aim of an ASR is to build an acoustic model and a language model based on speech waveforms and human annotated transcriptions with extra data from spontaneous speech. ASR uses

Hidden Markov Models (HMMs)

Gaussian Mixture Models (GMMs)

End to end deep model based automatic speech recognition using Connectionist Temporal Classification (CTC).

Copyrighted © Uma Chandrasekhar 2023

The speech recognition architecture consists of five phases (Speech to Text)

Step 1. Feature Extraction — Already discussed in Part — 2 ( from Speech waveform to Spectrogram or Time domain or frequency domain map)

Step 2. Frame classification is achieved using Gaussian Mixture Models (Probabilistic) and HMM (Deterministic) based deep models. The one advantage of HMM over GMM is that most probabilistic models make assumptions about input data, nonetheless the deterministic models, take data, as it is and hence operates efficiently than their probabilistic predecessors.

Step 3. Sequence modeling is achieved using Hidden Markov Models. HMMs are deterministic finite state machines where the sequence state of the speech is decided. In terms of a sentence, “I have a cat”, the sequence order of

  • the word ‘have’, follows ‘I’
  • the word ‘a’ follows ‘have’
  • word ‘cat’ follows ‘a’ and it is followed by a stop syllable too.

HMMs can be trained using

  1. Context dependent model -Usually done using phonic sequence and grammatical formations of the language. For example, in English, most three syllable words consists of a sequence of ‘vowel-consonant- vowel’ (Bat) or ‘consonant- vowel- consonant’ (Ate)
  2. Clustering algorithms based on decision tree model — Here the words or syllables are grouped into clusters based on their depth and intensity and the sentences or words are formed by selecting those words or alphabets which are associated to the previous one. Clustering decision is based on their proximity distance of two syllables or words, calculated from the central mean. For example, in a cluster, the intensity of a silence is mapped further away from a starting syllable, as silence signifies a stop syllable.

One disadvantage of HMMs is that they are slow in processing as their dynamic programming considers up to 10,000 previous states, as every state varies with the feedback received from the previous states.

Sequence Modeling can also be done using Connectionist Temporal Classification. This model is widely used to train the unsegmented data sequences between words. This is mostly specified by the silence or the fill-in gibberish such as ‘hmm’ or breathing sounds, etc., they are otherwise known as connecting waveforms. This is achieved using differentiable objective function and gradient based training and widely employed in ’DeepSpeech’ and ‘DeepVoice’. The gradient function used in it denotes the intensity of these connection words in comparison to the real words and hence allows the models to categorize the noise intensities.

Copyrighted © Uma Chandrasekhar 2023

The advantage of CTC is that they can be bidirectional. In other words, they can do Speech to Text and Text to Speech and the error rate is lesser than their HMM counterparts in bi- directional modes.

Copyrighted © Uma Chandrasekhar 2023

Step 4. Lexicon model

Phoneme recognition is usually done in this penultimate step based on the amplitude of the sound. The four main categories, as mentioned above are start and stop sounds, fricative sounds and vowels and constant identification. Attention based modeling is utilized for lexicon models. This type of modeling is highly employed in end-to-end trainable speech recognition systems, through training of recurrent networks (CNNs with feedback). Attention based recurrent sequence generators are applied to create an output sequence of phonemes generated through a sequence of input vectors. The lexicon model is constructed using a three-part architecture of Neural encoder, Attention Mechanism and Neural decoder as shown in the below image.

Copyrighted © Uma Chandrasekhar 2023

The neural encoder maps the variable length sequence (О) into an intermediate embedded sequence(Ɛ). While h is the hidden information used by the Neural decoder to decipher the next word D, Y is the randomly generated output sequence, given an input sequence X. The attention mechanism provides the variable length annotation vector (h = Ɛ Υ) when combined with the posterior distribution of the neural decoder predicts the next sub word, till the end of sequence is encountered.

Step 5. Language model

Language modeling is done using the same technique as the lexicon modeling, except instead of constructing a sub word or a syllable in every iteration i, a word is added to the sentence.

Copyrighted © Uma Chandrasekhar 2023
Copyrighted © Uma Chandrasekhar 2023

Music Classifiers

Audio (Music) files are stored in either of the three formats:

  1. .mp3, (MPEG (Moving Pictures Experts Group) for audio files)
  2. .WMA(Windows Media Audio) format
  3. .wav(Waveform Audio File)
Copyrighted © Uma Chandrasekhar 2023

Once the analog audio is made into digital spectrogram, five features are deduced from the same for music genre classification as shown in the above image.

Zero crossing rate

Spectral centroid

Spectral Roll off

Copyrighted © Uma Chandrasekhar 2023

Mel frequency Cepstral coefficients (Refer above image)

Copyrighted © Uma Chandrasekhar 2023

Chroma Frequencies (Refer to above and below images)

Copyrighted © Uma Chandrasekhar 2023

All the above five features are used in music genre classification using machine learning. The top seven music genre classification in Western music are Pop, Rock, Jazz, Hip Hop & Rap, Classical, Country and K-Pop. Music Genre classification is usually done using ML/DL using the five steps mentioned in the image.

Copyrighted © Uma Chandrasekhar 2023

The ML algorithms analyze the data sets and extract the feature of the audio waveform and use them to categorize or group the audio. Most recent ML /DL work (Panagakis and Kotropoulos,) considered a framework which is based on the properties of the auditory human perception system, i.e., 2D auditory temporal modulations based on sparse representation (Pixel wise residual energy-based classification).

To extract the beat-related characteristics, wavelet transforms (Discrete Cosine Transformations) are being used, getting the 2D-beat histograms. For the timbre characteristics, the relative amplitude of the harmonics of all detected beats are considered and then their histograms are computed. There are other good ML/DL algorithms which can also do the job. A few of them are K- Nearest Neighbor, CNN, and Naives Bayes Theorem etc., Most of them are supervised algorithms and are similar to image classification using spectrograms.

In a recent study (Antonio Jose, et al) Support Vector machines are used for classification, where 80 to 90% accuracy is achieved, using entropy of the audio frame. The entropy of each audio frame is obtained from their spectrogram. Based on the entropy (total energy) value, five feature vectors (Average entropy of the entropies, Standard Deviation of the entropies, maximum entropy and minimum entropy of all the entropies and maximum entropy difference among consecutive frames of the music signal) of each music frame are calculated and used in SVMs algorithms for classification.

Applications of Audio Signal Classification

1) In computing tools ASC is being used for NLP, emotion mining, chatbots etc.,

2) In consumer electronics, SoCs or even embedded devices in telephone, television, IoT consumer devices etc.,

3) Automatic equalizer as used in audio filters, where the original signal is altered either using amplification or attenuation and the resultant signal is of an enhanced quality than the original signal.

4) In networking, ASC is used for Automatic Bandwidth Allocation. In multiplexing technology like TDMA (Time Division Multiplexing) or WCDMA (Wide Band Code Division Multiplexing) or Orthogonal frequency Division Multiplexing (OFDMA) can allocate more or less bandwidth depending on the type of the audio signal transmitted or received. Say for example, increased bandwidth is required to transmit music than to transmit speech files and no bandwidth required, if it’s mere noise. This is also good for audio data stream transmission on Internet (VOIP or MOIP)

5) In music archiving activities, ASC is used to do Audio Database Indexing (ADI). ADI is also used in background soundtrack archiving in movies or broadcasting facilities of other entertainment industries and also for music streaming on the internet.

I have tried my best to give a gist of my work in audio classification (both speech and music), so that my audience can benefit from the same. Nonetheless the topic is so vast and deep, I think I would not even have scratched the surface with this three-part series, as it needs to be done as a series of 20- 25 articles, due to the volume of information involved in it.

References

--

--

Uma Chandrasekhar

I live and work as an executive technical innovator in Silicon Valley, California . I love working in autonomous systems including AVs.