Speech-to-Text AI: speech recognition and transcription
Turn speech into text using Google AI
Convert audio into text transcriptions and integrate speech recognition into applications with easy-to-use APIs.
New customers also get up to $300 in free credits to try Speech-to-Text and other Google Cloud products.
Features
Advanced speech AI
Speech-to-Text can utilize Chirp 3, Google Cloud’s foundation model for speech trained on millions of hours of audio data and billions of text sentences. This contrasts with traditional speech recognition techniques that focus on large amounts of language-specific supervised data. These techniques give users improved recognition and transcription for more spoken languages and accents.
Support for 85+ languages and variants
Build for a global user base with extensive language support. Transcribe short, long, and even streaming audio data. Speech-to-Text also offers users more accurate and globe-spanning deployments for transcription with Chirp 3, the next generation of universal speech models.
Chirp 3: Transcription was built using self-supervised training on millions of hours of audio and 28 billion sentences of text spanning 100+ languages.
Streaming speech recognition
Receive real-time speech recognition results as the API processes the audio input streamed from your application’s microphone or sent from a prerecorded audio file (inline or through Cloud Storage).
AI-powered speech recognition and transcription
Speech-to-Text uses model adaptation to improve the accuracy of frequently used words, expand the vocabulary available for transcription, and improve transcription from noisy audio. Model adaptation lets users customize Speech-to-Text to recognize specific words or phrases more frequently than other options that might otherwise be suggested. For example, you could bias Speech-to-Text towards transcribing "weather" over "whether."
Out-of-the-box regulatory and security compliance
Speech-to-Text API v2 gives enterprise and business customers added security and regulatory requirements out of the box. Data residency enables the invocation of transcription models through a fully regionalized service that taps into Google Cloud regions like Singapore and Belgium. Logs for resource generation and transcription are made easily available in the Google Cloud console. And Speech-to-Text API v2 offers enterprise-grade encryption with customer-managed encryption keys for all resources as well as batch transcription.
Speech adaptation
Speech-to-Text On-Prem
Have full control over your infrastructure and protected speech data while leveraging Google’s speech recognition technology on-premises, right in your own private data centers. Contact sales to get started.
Multichannel recognition
Speech-to-Text can recognize distinct channels in multichannel situations (for example, video conference) and annotate the transcripts to preserve the order.
Noise robustness
Speech-to-Text can handle noisy audio from many environments without requiring additional noise cancellation.
Domain-specific models
Choose from a selection of trained models for voice control and phone call and video transcription optimized for domain-specific quality requirements. For example, our enhanced phone call model is tuned for audio originated from telephony, such as phone calls recorded at an 8khz sampling rate.
Content filtering
Profanity filter helps you detect inappropriate or unprofessional content in your audio data and filter out profane words in text results.
Transcription evaluation
Upload your own voice data and have it transcribed with no code. Evaluate quality by iterating on your configuration.
Automatic punctuation (beta)
Speech-to-Text accurately punctuates transcriptions, such as by providing commas, question marks, and periods.
Speaker diarization
Know who said what by receiving automatic predictions about which of the speakers in a conversation spoke each utterance.
Compare Speech-to-Text Chirp model in API and Vertex AI Studio
| Product | What is it | Best for | Key features |
|---|---|---|---|
Chirp 3: Transcription in Vertex AI | A simple to use no code, web-based, graphical user interface. | Rapidly test audio files, quickly prototype, create audio transcription, upload audio or recordings directly into a web browser. | -Enhanced multilingual language detection and transcription -Supports transcription in 85+ languages and variants -Supports speaker diarization and model adaptation -Automatic speech recognition, transcribing audio into text -Multilingual language detection and transcription |
Chirp 3: Transcription on Speech-to-Text V2 API | An API that is the next generation of Google's universal Speech-to-Text model, unifying data from multiple languages. | Building scalable, Enterprise-grade applications. Easy transcription integration into existing software. | -Enhanced multilingual language detection and transcription -Supports transcription in 85+ languages and variants -Supports speaker diarization and model adaptation -Automatic speech recognition, transcribing audio into text -Multilingual language detection and transcription |
Chirp 3: Transcription in Vertex AI
What is it
A simple to use no code, web-based, graphical user interface.
Best for
Rapidly test audio files, quickly prototype, create audio transcription, upload audio or recordings directly into a web browser.
Key features
-Enhanced multilingual language detection and transcription
-Supports transcription in 85+ languages and variants
-Supports speaker diarization and model adaptation
-Automatic speech recognition, transcribing audio into text
-Multilingual language detection and transcription
Chirp 3: Transcription on Speech-to-Text V2 API
What is it
An API that is the next generation of Google's universal Speech-to-Text model, unifying data from multiple languages.
Best for
Building scalable, Enterprise-grade applications.
Easy transcription integration into existing software.
Key features
-Enhanced multilingual language detection and transcription
-Supports transcription in 85+ languages and variants
-Supports speaker diarization and model adaptation
-Automatic speech recognition, transcribing audio into text
-Multilingual language detection and transcription
How It Works
Speech-to-Text has three main methods to perform speech recognition: synchronous, asynchronous, and streaming. Each method returns text results based on if transcription is needed in post processing, periodically, or in real time. Simply put, you'll input audio data and then receive a text-based response.
Demo
Test out the Speech-to-Text API
Quickly create audio transcription from a file upload or directly speaking into a mic.
Common Uses
Transcribe audio
Tutorials, quickstarts, & labs
Caption videos using AI
Tutorials, quickstarts, & labs
Add Speech-to-Text to apps
How to add Speech-to-Text to apps
Learn how you can quickly and easily enable Speech-to-Text for your application with Google Cloud. This video covers how to add AI to your application without extensive machine learning model experience. Using the pretrained Speech-to-Text API you'll quickly and easily enable AI for your application.
Tutorials, quickstarts, & labs
How to add Speech-to-Text to apps
Learn how you can quickly and easily enable Speech-to-Text for your application with Google Cloud. This video covers how to add AI to your application without extensive machine learning model experience. Using the pretrained Speech-to-Text API you'll quickly and easily enable AI for your application.
Generate a solution
What problem are you trying to solve?
What you'll get:
Step-by-step guide
Reference architecture
Available pre-built solutions
This service was built with Vertex AI. You must be 18 or older to use it. Do not enter sensitive, confidential, or personal info.
Pricing
| How Speech-to-Text pricing works | Speech-to-Text pricing is based on the API version, channels, batch methods, and any additional Google Cloud service costs like storage. | |
|---|---|---|
| API version | Service and capability | Pricing |
Speech-to-Text V2 API | V2 offers data residency for multi and single region deployments of Chirp 3. V2 does include audit logging and support for customer managed encryption keys. | $0.016 per min |
How Speech-to-Text pricing works
Speech-to-Text pricing is based on the API version, channels, batch methods, and any additional Google Cloud service costs like storage.
Service and capability
V2 offers data residency for multi and single region deployments of Chirp 3. V2 does include audit logging and support for customer managed encryption keys.
Pricing
Pricing calculator
Estimate your monthly Speech-To-Text costs, including region specific pricing and fees.
Custom quote
Connect with our sales team to get a custom quote for your organization.
Start your proof of concept
New customers get up to $300 in free credits to try Speech-to-Text and other Google Cloud products
Have a large project?
Speech-to-Text On-Prem
Speech-to-Text basics
Speech-to-Text code samples


