Consultancy & Research

Consultancy

At Voice and Speech Systems we also undertake consultancy for customer specific needs.
Some examples of other consultation projects: Helium speech unscrambling, LP vocoder evaluation, acoustic environment simulation, surround sound simulation, concatenation-based text-to-speech synthesis, customized hardware development etc.

Research

A new technique on robust epoch extraction has been recently published in an international journal. (A. P. Prathosh, T. V. Ananthapadmanabha and A. G. Ramakrishnan, “Epoch extraction based on ILPR using the plosion index.”,IEEE Trans. Audio, Speech and Language Processing, 2013.)

A new technique for detection of stops and V/UV decision has been reported to an international journal. (T. V. Ananthapadmanabha, A. P. Prahtosh and A. G. Ramakrishnan, “Detection of the closure-burst transitions of stops and affricates from continuous speech using the plosion index – Accepted to be published in JASA)

VSS has developed some original techniques and algorithms in the areas mentioned below. Some of these techniques have already been incorporated into the products Vagmi Therapy and Speech Science Lab.

Automatic Speech Recognition: The emphasis of research in this area is on acoustic-phonetics knowledge-based approach for the front-end in finding robust, speaker independent, context independent acoustic correlates for various phonetic features. Recent collaboration with EE Dept., IISc, Bangalore on some select problems in this area has resulted in international standard publications.

Text-to-Speech Synthesis: Emphasis is on the Articulatory modeling and development of a synthesis model.

Speech Coding: Use of Voice Source Model in LP Vocoder (Voice Excited Linear Prediction – VSLP) along with vector quantization. Formant Based Synthesis tools have also been developed.

A brief presentation of the major findings of the above research areas is given below:

1. Automatic Speech Recognition

A hierarchical classification is proposed. A part of the tree structure is shown.

V/U/S/B classification and Segment Boundary Detection: Techniques for identification of the broad classes of sounds such as Voiced/Unvoiced/Silence/Burst has been approached. The segment boundaries in the first phase uses uniform frame rate analysis and in the second phase the segment boundaries are fine tuned. In this approach, non-speech regions are identified based on certain spectral properties. Segment boundaries based on a new definition of the slope of intensity and Euclidean distance of MFCC of successive frames along with some rules.

These techniques have been incorporated in the SSL software product of VSS. An example of the utterance ‘speech communication’ is shown below. For graphical illustration, unvoiced segments are shown as red bars, voiced as cyan bars, stops as yellow bars and silence, non-speech segments as black bars. Vertical lines show the proposed segment boundaries. Forced alignment is done based on the V/UV/B/S detection and segment boundaries, given the input phone string for an utterance. The assigned phone labels are also shown in the figure. The assigned segment boundaries may be edited using a user friendly tool. Phone labels and default durations may be defined by the user.

Invariant acoustic cues for the classification of vowels: Recognition is based on robust ‘acoustic properties’ referred to as computational distinctive features. Vowel identification is based on computed features High or Low, Front or Back, Jaw Open or Jaw Close. Computed feature is compared to a threshold. Threshold is independent of the speaker, gender and context. This approach is speaker independent and language independent. Highly successful vowel identification has been achieved.

Figure below shows computed features High-Low, Front-Back for vowels ‘aa’, ‘ee’, “A’, ‘O’ and ‘oo’. Red line indicates the threshold. Note that vowel ‘aa’ is correctly identified as ‘Back-Low'; vowel ‘ee’ as ‘High-Front'; vowel ‘A’ as diphtongised; vowel ‘O’ as Mid-Back; vowel ‘oo’ as ‘High-Back’.

Figure below shows computed features High-Low, Front-Back, Jaw Open-Close for vowle ‘ee’ in the context of ‘speech’ spoken by four different speakers. Red line indicates the threshold. Note that vowel ‘ee’ is correctly identified as ‘High-Front’ for all speakers.

Figure shows vowel ‘ee’ in different contexts (E-set : bee, cee, dee etc.) spoken by the same speaker. Note that vowel ‘ee’ is correctly identified as ‘High-Front’ in all the contexts.

The above technique has been used in Vagmi Therapy for teaching proper pronunciation of vowels. As the client utters a vowel, the high/low, front/back features are computed almost in real-time. A small filled block moves in the vowel space divided into the broad regions of the feature space as shown below:

Also, the high/low and front/back features are mapped on to the tongue body profile and the estimated tongue body profile (shape) is shown superposed on the model almost in real-time as shown below:

Place classification of fricatives:

Highly successful identification of ‘s’ Vs ‘sh'; ‘s’ Vs ‘z’ has been achieved. This result is used in pronunciation therapy of fricatives. This is depicted via pictures of a ‘snake’ or a ‘sheep’ or a ‘zeebra’ when ‘s’ or ‘sh’ or ‘z’ is pronounced, respectively.

Speaker adaptation: A technique for adaptive estimation and cancellation of spectral tilt due to the influence of speaker’s voice has also developed.

2. Text to Speech Synthesis based on an articulatory model

This is an on-going research project. At present, word and sentence level utterances may be synthesized using default variables. The model is available also as a development tool in SSL-Workbench for Articulatory-Synthesis.

The main features of the model are:

  • Default articulatory positions and their dynamics are saved in a database.
  • For a given, phone input, the articulatory parameters are accessed from the database.
  • Articulatory parameters are interpolated
  • Formant data are computed
  • Default intonation pattern is used.
  • Anantha’s Voice source Model is used
  • Default duration of segments intonation is used
  • The source parameters, default articulatory positions, rate of transitions, duration of segments etc may be edited by the user.
  • The rules for generating the articulatory dynamics, source dynamics are in a script notation in a text file which may be edited by the user.

Sample outputs using articulatory synthesizer*:

* Wave Files. Please select “Open” in the ‘File Download Window’.

Ex.1: Some select English Words (Alcohol, Bucket, Capacity, Elastic, Post-office, Typical, Traffic Police)

Text-to-speech synthesis using an articulatory model – Audio Demo – English words

Ex.2: An English Sentence (Calcutta is a big city.)

Text-to-speech synthesis using an articulatory model – Audio Demo – An English sentence

Ex.3:A Hindi Sentence in the context of announcement in a railway platform (Bhoopaal Calcutta express chaar per aa rahee hain.)

Text-to-speech synthesis using an articulatory model – Audio Demo – A Hindi sentence

3. Speech Coding based on VSLP model

This uses voice source excited LP model shown in the Figure below. Compressed format storage of the parameters has also been built into the program. A range of bit rates is possible using scalar quantization, vector quantization, segment coding. This is available as a product SSL-Workbench for Speech Coding.

The effectiveness of VSLP model is demonstrated by means of an example.

Original

Synthesized: 4700 bits per second, scalar quantization: Audio

Some examples of coded speech:

4700 bits per second, scalar quantization: Audio

2400 bits per second, vector quantization: Audio

474 bits per second – Segment coding: Audio

An example of speaker transformation using formant coding:

By scaling the formants, it is possible to experiment with speaker transformation.

Original

Transformed

In fact, historically, the very first order for VSS was a consultancy contract by Ericssons, Sweden for improving the quality of a LPC vocoder.
The very first product developed at VSS was a Token Announcement System with two voices or two languages, widely used in banks and clinics, was an invention of VSS whose know-how was transferred to a company for manufacturing and marketing.