September 12, 2024
9 mins
Despite the meteoric rise in omnichannel outreach, voice interactions and telecalling remain the primary choice for customer engagement.
Microsoft’s ‘State of Global Customer Service’ survey reported that nearly 40% of global consumers preferred phone or voice channels for customer support. On average, a customer spends six minutes with a telesales agent.
At SquadStack, we make more than one lakh telesales calls in a day and have 50k+ hours of conversation. We handle over 40 million calls/year across different industries in India’s massive consumer market.
This means the telecalling industry has exabytes of consumer data, much of which remains unanalyzed and underutilized.
While text-based interactions are often used to train analytical engines, voice data is left untouched. As per research by Call Center Helper, less than 3% of telecalling-based customer interactions are analyzed.
The telesales industry needs to utilize this vast amount of data to boost customer experience, improve agent performance, and provide top-notch services to its clients. This opens a wave of possibilities for engineering and AI innovations involving linguistics, NLP, and audio processing. Automatic speech recognition (ASR) is a prerequisite for most applications.
However, the quality of existing ASR technologies for Indic languages, complicated by mixed code, is a barrier to developing applications like
This is where SquadStack’s Speech Recognition technology comes in. Our goal is to create a dynamic solution tailored to the nuances of the country's diverse languages, providing unparalleled accuracy in transcribing these mixed-language calls.
Let’s dive deeper into how our in-house ASR system is revolutionizing India’s telecalling landscape and discover how it can transform your business.
The Speech Recognition system at SquadStack supports a broad array of downstream applications. Our speech recognition models convert all the call recordings stored on the AWS S3 bucket into transcripts. These transcripts and the original audio recordings are finally used together as input to build an inference on the downstream applications.
Creating an ASR model for Indic languages has multiple challenges.
India is a diverse and multilingual country with 30 languages having close to a million native speakers and more than 19,500 dialects across the demographic. The problem is further aggravated by the fact that most of these languages have overlapping sounds and vocabulary due to the geographical and historical proximity of their native speakers.
Additionally, many Indians are multilingual, making it challenging to train language-processing models.
Apart from linguistic challenges, here are the three primary concerns we faced to reach the baseline performance of the model.
The first challenge in this journey was understanding the data requirements.
While data is crucial for any machine learning problem, it holds even greater significance for speech recognition models due to their high annotation complexity, resulting in significantly elevated annotation costs.
Annotating a call recording (as shown in figure 2) involves listening to the call multiple times to gather context. Additionally, due to the code-switched nature of the calls, annotators must continuously switch keyboard languages, increasing the time required for annotation and escalating the associated costs.
Moreover, phone calls often contain noise and overlapping speech, further complicating the annotation process. Although we can mitigate noise to some extent using techniques like noise gating, equalization, and frequency cutoffs, these methods come with their own set of challenges and expenses.
Even a few hundred hours of annotated data demands an investment of tens of thousands of dollars, so it's essential to carefully analyze your data needs.
Choosing the right infrastructure for training and inference is equally essential.
The Graphic Processing Unit (GPU) forms an integral part of the training infrastructure. GPUs are specialized hardware designed for fast and efficient parallel processing, making them highly useful for speech recognition systems by handling large volumes of data-intensive tasks simultaneously.
Questions like which GPU is required, what factors determine the GPU memory requirements, whether a single GPU works for us or if we will need multiple GPUs, what the estimated inference cost is, etc., become very important. These questions directly impact experimentation and inference costs.
To find the best option, we compared cloud GPU providers like AWS, Google Cloud, and E2E Networks, focusing on availability, security, and cost. For example, AWS had good security but limited GPU availability in India at our scale, and higher-end GPUs were very expensive.
Although we initially operated on AWS due to existing systems, we were aware that scaling our experiments to thousands of hours of recordings or running inferences across all calls daily might pose significant challenges. Additionally, setting up infrastructure in-house required substantial maintenance efforts, making it out of the question.
In our exploration of speech recognition algorithms, we've encountered Convolutional Recurrent Neural Networks (Conv-RNNs), Transformers, and Conformers-based architectures.
While Conv-RNNs combine pattern recognition with sequence understanding, Transformers excel in efficiently processing and deciphering speech data. Conformers, the latest advancement, offer superior performance in capturing speech nuances.
The optimal choice among these algorithms depends on the specific characteristics of the speech data. Given the significant training time involved, careful consideration of algorithm selection is vital to optimize resources and time.
Also Check: Scalable, Smart, and Secure: SquadStack’s AI Voice Bot Solution
Through meticulous review and analysis, we’ve figured that a Transformer-based algorithm is best suited for our model.
Transformers are built on the fundamentals of the encoder and decoder mechanism. The role of the encoders is to map the input signal into a problem-specific feature representation. At the same time, the decoder acts as an auto-regressive model, translating the input one step at a time while leveraging context from the previous step.
However, what sets Transformers apart from Recurrent Neural Networks (RNNs) is the self-attention mechanism.
Think of it as querying a database: the output from the last layer trains a mask, acting as a query to identify relevant information. This self-attention mechanism enables Transformers to capture long-range dependencies at a lower runtime cost, enhancing our model's efficiency and performance.
We meticulously evaluated various multilingual speech recognition models from leading providers such as Google, Whisper, and Amazon, using SquadStack's proprietary Indian telesales data.
Unfortunately, there’s a significant gap in the available general-purpose services: none were specifically designed for Indic languages or code-switched Hinglish. Most existing solutions were tailored solely for the English language.
This resulted in a significant word error rate, with even the best-performing general model (Amazon Transcribe at 33.17%) falling short. Our telesales-specific model surpassed these general-purpose services, offering a substantial performance boost and cost reduction.
To improve customer interactions, telesales teams require data-driven decision-making.
Our calling operations are conducted through a crowdsourced telecalling workforce of 10K+ callers across India, carefully selected and trained by SquadStack. Our speech recognition model is key to unlocking new values and insights from these customer interactions.
This information is easy to analyze and can be scaled to analyze 100% of our calls. From this analysis, we can effectively outline trends and patterns that can boost customer satisfaction. This data, in turn, will be used to improve our agent training, monitoring, and feedback systems.
Here’s how SquadStack’s speech recognition model can enhance your sales processes.
One of the primary benefits of ASR technology is automated transcriptions. Every word spoken during a call will be transcribed accurately, so telecalling agents do not need to rely on hurriedly scribbled notes. Additionally, it helps build a trusted and credible archive of all the conversations for analytical and compliance purposes.
According to Zendesk’s Trend Reports 2024, 81% of customers say that quick resolution and support by the sales team heavily influence their purchase decision. A speech recognition model can help agents offer faster support and service to your customers.
For instance, speech recognition can be leveraged to build chatbots that offer round-the-clock service to customers. Additionally, automatic transcriptions can make it easier for agents to summarize sales calls so they can help the next customer faster.
SquadStack’s ASR provides a detailed analysis of transcribed call recordings, giving our clients better insights into customer behavior, agent performance, and product-related feedback.
Speech recognition can analyze customers’ and agents’ voices and pitch. It can also pick up the context of conversations and prompt agents on how to resolve client issues.
ASR allows quick quality audits by identifying trigger words and phrases in customer interactions. This enables supervisors to review and assess calls for compliance and quality assurance promptly.
Moreover, the transcripts of such calls can be used to train telecalling agents to respond to customers with more empathy and understanding. Transcripts will also serve as objective accounts of the customer’s complaints and the steps taken to resolve the issues.
Conducting quality audits can be challenging, especially with the high volume of calls in the telecalling industry. Our ASR technology facilitates quick and accurate quality monitoring by providing detailed transcripts of all calls. These transcripts enable quality assurance teams to conduct thorough audits more efficiently, ensuring compliance and performance standards are met.
Additionally, these transcripts serve as resources for effectively training telecalling agents.
By analyzing the transcripts of successful calls, agents can learn best practices, understand customer pain points, and improve their communication skills. With ASR-generated transcripts, the training process becomes more targeted and efficient, significantly reducing the time required for agent training.
Analyzing call recordings gives our clients valuable insights into their potential customers. These insights can be used to create sales strategies that appeal to customer needs.
Using our ASR technologies, clients can:
Speech recognition technology is disrupting India’s telecalling landscape, and SquadStack is right in the thick of it. From enhancing telesales services to boosting customer satisfaction, the journey of creating ASR tech for diverse Indic languages has been revolutionary.
While our primary focus is on solving downstream applications around agent coaching, automated quality, and lead routing through various conversation insights gathered from the calls and enriched by transcripts from speech recognition, we also aspire to continuously improve our speech recognition engine. We aim to expand into different vernacular languages to improve the word error rate further.
Ready to explore the future of telecalling with SquadStack? Schedule a demo now to learn more.