Voice recognition in business and industry - TOP Tech Talk

September 2, 2024

Although figures for the growth of voice recognition devices are debated by analysts, the global market is set to go from around $10 billion last year to just under $50 billion by 2029. With a yearly growth rate of approximately 24% per year (1).  

In the last few years voice recognition technologies have reached a level of maturity primarily because of extensive investment over the last ten to fifteen years. This was further augmented by COVID-19 because this stimulated accelerated development in many different technologies that permit social distancing.

In addition, voice recognition has become a dominant technology in the consumer electronics segment, which has augmented the development, investment, consumer interest and uptake.

Voice recognition enables device control without the requirement to touch surfaces, and hence has been an area of intensified development and industry focus.

In the home and even working environments, there has been growing acceptance of the value of voice control technologies which have evolved to become more sophisticated, useful and accepted. This is because the hardware, firmware, back-end cloud and edge processing that powers them has become significantly more powerful over the last few years.

Figure 1 - Voice Recognition - Audio Control - TOP Tech Talk

In the realm of artificial intelligence and signal processing, voice recognition technologies have evolved into sophisticated systems that exhibit remarkable precision and versatility. At the core of these advancements lies the intersection of acoustic modeling, language processing, and machine learning algorithms.

Acoustic modeling forms the foundational layer, delving into the intricacies of speech signal processing. Through the deployment of techniques such as Hidden Markov Models (HMMs) and deep neural networks (DNNs), modern voice recognition systems excel in capturing and interpreting the subtle nuances of spoken language. The ability to discern phonetic variations and acoustic cues enables these systems to achieve unprecedented accuracy in speech-to-text conversion.

Language processing, another critical facet, involves the integration of natural language understanding (NLU) and natural language processing (NLP) techniques. This layer empowers voice recognition systems to comprehend context, syntax, and semantics, thereby enhancing their capability to interpret user commands, queries, and conversational nuances accurately. The utilization of recurrent neural networks (RNNs) and transformer models has notably propelled the contextual understanding capabilities of these systems.

Machine learning algorithms, particularly deep learning frameworks like Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs), play a pivotal role in refining the performance of voice recognition technologies. Training on extensive datasets, these algorithms continuously adapt and optimize, ensuring adaptability to diverse accents, languages, and speaking styles. The evolution from traditional Gaussian Mixture Models (GMMs) to advanced deep learning architectures has significantly elevated the efficiency and robustness of voice recognition systems.

In addition to these core components, the incorporation of speaker diarization, where the system distinguishes between multiple speakers in a conversation, and voice biometrics for user authentication further accentuate the multifaceted capabilities of contemporary voice recognition technologies.

As we navigate the intricacies of this technical landscape, it's evident that voice recognition technologies are not just confined to the transcription of spoken words. They represent a dynamic fusion of disciplines, continually pushing the boundaries of what is achievable in human-computer interaction. The ongoing research and development in this domain promise a future where voice interfaces seamlessly integrate into diverse applications, ranging from smart homes to complex enterprise solutions.

Product developments in voice capture

Tempo Semiconductor Incorporated, has entered several partnerships to augment their digital mic array solutions. In collaboration with one of them, Tempo’s TSDP1808xx is one of the highest performance digital multi-mic solution in the industry.

With Tempo’s TSDP1808xx hardware decimator (>142 dB) it is possible to support true 24-bit audio support for smart speakers and video conferencing solutions.

The TSDP1808xx hardware decimator includes a digital mic 8:1 aggregator that enables a lower cost point / power consumption point than any of the alternative design approaches.

With a partner Tempo’s TSDP1808xx provides a complete solution for voice support for example for Google Voice Assistance, and Amazon Voice Services.

The TSDP1808xx is a scalable solution to use with NXP’s i.MX 8M Family and provides high dynamic range to enable >16-bit audio support. For more information on these products please reach out to us.

One of the big technical challenges associated with voice recognition is to capture and recognize voice in difficult acoustic conditions.
Such as:

  • high ambient noise,
  • reverberation,
  • around corners
  • or simply when the person is a long distance away (> 3m) from the device.

The challenges of far-field voice capture

One of the big technical challenges associated with voice recognition is to capture and recognize voice in difficult acoustic conditions such as high ambient noise, reverberation, around corners or simply when the person is a long distance away (> 3m) from the device.

Figure 2 - the challenge of voice recognition in difficult acoustic conditions - Audio Control - TOP Tech Talk

This is a challenge because it requires specialised hardware, firmware and knowledge of microphone-array echo cancellation and signal processing. One company that has specialized in this area is ArkX Laboratories who make a range of Audio-Front-End modules that provide reliable voice recognition at distances greater than 9m with just 3 or 4 microphones.

As well as software development, there has been tremendous interest to create AI optimised hardware that is more and more capable of the local processing that enables seamless interaction with web and edge based back-end services for speech recognition.

Ambiq is one of the companies investing heavily in this area. Ambiq make smart AI-enabled endpoints with Sub-Threshold Technology (SPOT), enabling the lowest power compute in the industry. Interestingly, some consumer brands have integrated AI cores within their processors for the same reason.

It is also worth noting that TOP also represents Goermicro that offers a broad range of Digital and Analog MEMS microphones that can support, consumer, medical, industrial and automotive applications. TOP also is the distributor for Summit Electronics and CUI Devices that offers loudspeakers, piezo electric transducers and buzzers.

Evolution of Voice Recognition Use Cases

In recent years these loudspeakers has gone beyond basic automation tasks, to respond more like digital assistants. In parallel many new other use-cases have been emerging in business and industry that further expand what we consider speech recognition devices can do. Let’s consider some of these for a moment. 

Smartphones, TV’s, Set top boxes, PC’s and Smart Speakers

Voice recognition technology is getting integrated into a huge number of consumer electronic platforms as a way to simplify the user interfaces. Many of these devices do not have easy access to keyboards so this technology provides an easier control mechanism to expand their functionality. The huge production volumes involved is driving forward the whole voice control industry and accelerating developments at all parts of the signal and processing chains.

Smart Speakers in Automotive

Figure 3 - Automotive use cases - Audio Control - TOP Tech Talk


A key growth area for voice recognition is in automotive use cases. An increasing number of new vehicles have voice recognition built-in, and people are much more willing to use these features in a vehicular environment because of the clear safety advantages.

They permit the driver to stay focused on the road while performing other tasks such as:

  • selecting music,
  • initiating or responding to phone calls,
  • managing other tasks like sending and receiving texts.

People see the immediate value in being able to perform these tasks safely and so they are willing to learn how to use these features effectively.

Finance and Banking

One interesting use of the technology has been in voice biometrics for the purposes of authentication. In a similar way to biometric authentication via fingerprint or facial or retina recognition. Voice biometrics uses the human voice to validate identity. This is being used in the financing and banking sectors in conjunction with traditional authentication techniques as an additional way to authorise transactions.

This technology can also be used to unlock smart devices as a more convenient alternative to traditional passwords.

Speech to text

The massive investments and growth in the speech recognition industry have created a genre of products which are genuinely “good enough to use” for many speech-to-text applications. At the same time there are certain showcase use-cases that have sparked massive interest and attention within the ‘big tech’ companies.

Doctor Note Taking

A study in 2016 (2) revealed that doctors on average spend 27% of their time with patients, and nearly half their time is spent doing desk work.

That’s one of the reasons why Amazon, Google and Microsoft, have all created voice recognition products for doctors to reduce the amount of time they are obliged to spend typing. There are tremendous development efforts underway within technology companies as they all race to try to fully enable this popular milestone use-case.

Figure 4 - A doctor taking a note - Audio Control - TOP Tech Talk


Industrial Applications

Voice control technology is becoming increasingly valuable in industrial environments, providing a hands-free solution for managing machinery and processes. In settings such as manufacturing plants, warehouses, and assembly lines, voice commands enable operators to control equipment, monitor system status, and execute tasks without needing to disengage from their workstations.

This leads to enhanced productivity, reduced operational errors, and improved safety, particularly in hazardous environments where manual control could pose risks. As industries seek to optimize workflows and embrace smart manufacturing, voice control is emerging as a crucial tool for enhancing efficiency and safety in industrial applications.

Legal Speech recognition and AI

There are multiple use-cases in the legal world where Speech Recognition is gaining momentum. For example, in court reporting, or in depositions or interrogations where it is used to transcribe the spoken text.

Another interesting use case is also the use of Natural Language Programming and AI where these technologies are being used to review legal documents to see if they meet regulatory criteria.

Public Areas

Post COVID-19, voice recognition use cases in public areas has become more important due to the desire to reduce contact with surfaces. Examples of use cases being developed include the use of voice recognition for elevator control.

Figure 5 - Public areas are a key concern post COVID-19 - Audio Control - TOP Tech Talk


Voice control combined with biometric detection is also being used for access control in hospitals and senior care facilities as an alternative to RFiD because it permits contactless access. Interactive vending machines are another use-case that is also emerging.

Nearly all the use-cases mentioned are multifunction, offering traditional control with voice recognition available for people to try out.
 

Conclusions

Speech recognition systems have reached a level of maturity in the last few years due to a long development cycle that was accelerated by the combined effect of COVID-19 and consumer-level production volumes.

A big emphasis has been to provide continuously upgraded, smarter edge and cloud services which provide the processing backbone for consumer smart loudspeakers and voice recognition. However, all aspects of the signal chain have been subject to accelerated developments due to the production volumes and the integration of these services into many TV sets, set top boxes etc.

The practical upshot of these combined developments is that smart loudspeakers, set top boxes and TV’s got much smarter over the past few years as all aspects of the signal chain have been improved and the processing backbones have been upgraded.

This in turn spurred interest in a few corner cases that require larger, more specialised vocabularies, for example the doctors notetaking. These use-cases are important because they push the limits of what the technology can do and help close the gap between 90% and 99% correct recognition. This makes all the difference from a usability point of view, both in these corner cases and in the nearly all target markets for voice recognition.

For comfort and usability in all acoustic environments the industry will also see continued growth in the area of further and further field voice capture, with greater bit depths being used, combined with highly specialized processing. We look forward to following the rapid evolution of these interesting technologies as voice recognition becomes more capable, accepted and integrated into our daily lives. 

We love to connect… engineers and technology

For more information on voice control or the Production-Ready Voice Processing Modules, reach-out to TOP-electronics, through [email protected] or by phone: +31 (0)180 - 580 492.

 

(1) Source Fortune Business Insights, speech and voice recognition market

(2) Source ACP Journals, Allocation of Physician Time in Ambulatory Practice

Back