Unlocking Free Speech-to-Text and Translation with Whisper AI
Written on
If you're familiar with my work, you know I often rely on ChatGPT 3.5 for assistance in writing, programming, and various tasks. However, the free version is limited to text, while a significant amount of information today is presented in audio or video formats. This limitation makes it challenging to integrate that data into ChatGPT unless you opt for a paid subscription.
Several software companies provide transcription services, such as Otter.ai, which offers a free basic plan with restricted transcription minutes and file uploads. While this might suffice for some users, those needing to transcribe more files monthly may find themselves needing a paid plan. Alternatively, you can utilize Whisper AI, an open-source neural network from OpenAI, which facilitates unlimited speech-to-text transcription and translation at no cost!
In this post, I will guide you through the process of installing Whisper AI on your computer and executing it from the command line (I prefer using the Anaconda prompt, but command line procedures will be similar). Additionally, I'll show you how to run Whisper in the cloud, eliminating the need for local Python installation or a GPU. Let's start with local installation.
Installing and Setting Up Whisper AI
You can find Whisper's installation guidelines on OpenAI's official GitHub repository, but I found the instructions somewhat perplexing. Therefore, I will simplify the process by providing clear, step-by-step directions for installing and utilizing Whisper on your device.
- Install Python: First, if you haven't done so already, download a version of Python from the Python Downloads page. I recommend using version 3.11 or lower to ensure compatibility with the package discussed in the next step.
- Install PyTorch: Next, you'll need to install the PyTorch library, essential for running Whisper. Visit the “Start Locally” page and choose the settings that align with your system. For instance, these were my selections:
I recommend opting for the stable version of PyTorch. I use Windows, so select your operating system accordingly. Although I typically use pip for library installations, Conda works equally well. Choose Python as your programming language, and for the compute platform, select CPU. If you are unfamiliar with CUDA, it's likely that you're using a CPU since GPUs are generally more expensive and typically acquired only by serious developers.
Once you have made your selections, copy the command from PyTorch's "Run this Command" field. You can paste it into your command line (type “cmd” in your search bar) or Anaconda prompt. Hit Enter to install PyTorch.
- Download a Package Manager: Before installing ffmpeg, you will need a package manager for your operating system. Check Whisper's instruction page for the appropriate one:
For Windows, Chocolatey is recommended, accessible at chocolatey.org. Open the Chocolatey website and click on the “Install” tab in the upper right:
Select the “Individual” use option:
Follow the installation instructions provided. To summarize, search for “PowerShell” in your computer's search bar, right-click it, and select “Run as Administrator”:
Copy the command from the Chocolatey webpage, paste it into PowerShell, and press Enter:
- Install `ffmpeg`: With the package manager in place, install the ffmpeg library using the command below in the command line or Anaconda prompt:
# on Windows using Chocolatey (https://chocolatey.org/) choco install ffmpeg
- Install Whisper: Finally, you can install Whisper by entering the command below into the command line or Anaconda prompt and pressing Enter:
pip install -U openai-whisper
The -U option upgrades Whisper to the latest version if you already have it installed. Congratulations! You're now ready to utilize Whisper on your machine.
Using Whisper AI Locally
To transcribe audio via the command line or Anaconda prompt, use the following command:
whisper "Test recording 1.m4a" --model base
The --model parameter allows you to choose from Whisper's available models. There are five models to select from:
In essence, larger models (from tiny to large) yield better transcription quality but require more processing time, resulting in a trade-off between quality and speed. I've found the medium model offers excellent transcription quality, but it operates slowly on CPU, so using the base or small model may be advisable.
One of Whisper's remarkable features is its ability to transcribe and translate audio and video files simultaneously. To specify the original language of the audio, add --language ru (for Russian) and --task translate, allowing Whisper to translate the audio from Russian to English, displaying the output in English:
whisper "Test recording 2.m4a" --model base --language ru --task translate
If you're uncertain about the original language, there's no need to specify it, as Whisper can automatically detect it. However, this may increase the time required to produce results. To see a list of supported languages, you can enter:
whisper --help
This command will present you with all parameters and options that Whisper accepts. Note that while Whisper can handle video files, including .mp4 formats, I recommend converting them to .mp3 using an online converter to reduce processing time, especially if you lack a powerful GPU.
If command-line operations aren't your preference, you can also use Whisper in your favorite IDE like this:
import whisper
model = whisper.load_model("base")
# Transcribe the recording result1 = model.transcribe("C:/path/Test recording 1.m4a")
# Translate the recording from Russian to English result2 = model.transcribe("C:/path/Test recording 2.m4a", language="ru", task="translate")
# Display the transcribed and translated text print(result1["text"]) print(result2["text"])
Finally, while using Whisper, you may notice files generated with various extensions: .txt, .json, .srt, .tsv, and .vtt. These represent different formats for your transcriptions:
- .txt: Plain text transcription
- .json: JSON formatted transcription
- .srt: Line-by-line transcription with timestamps
- .tsv: Timestamped transcription in a different format from .srt
- .vtt: Another timestamped format
Utilizing Whisper in the Cloud
As mentioned earlier, using a CPU can significantly slow down Whisper's performance. A viable solution is to use a cloud platform like Google Colaboratory. If you have a Google account, head over to your Google Drive and select “New.”
Next, navigate to "More." If you don't see “Google Colaboratory” listed, click on “Connect more apps”:
Search for “Colaboratory” and install it:
Return to Google Drive, select “New” -> “More” -> “Google Colaboratory” to open a new workbook.
In the new workbook, go to “Runtime” -> “Change runtime type”:
In the pop-up, select “T4 GPU” and click “Save”:
Now you can leverage a GPU (albeit limited) to run Whisper! In Google Colab, you can use both command-line and Python commands to execute Whisper. To install Whisper, use the following commands (as per Whisper's instructions for Ubuntu, which is the operating system Colab uses):
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
You can expand the “Files” icon on the left side:
Here, you can drag and drop the files you wish to transcribe or translate:
Once your files are uploaded, you can use Whisper with the same command-line commands mentioned earlier—just remember to prepend “!” to the commands:
!whisper "Test recording 1.m4a" --model medium.en !whisper "Test recording 2.m4a" --model medium --task translate --language ru
Or using Python syntax:
import whisper
model = whisper.load_model("medium") result1 = model.transcribe("Test recording 1.m4a") result2 = model.transcribe("Test recording 2.m4a", language="ru", task="translate") print(result1["text"]) print(result2["text"])
Utilizing a GPU should significantly enhance processing speed. After completion, you can find all the different file types generated in the “Files” tab. Be sure to download any files you want to keep before closing Colab, as all files will be deleted automatically at the end of the session.
That's all for now! If you found this post helpful, you might also enjoy my articles comparing generative AI chatbots and using ChatGPT to learn Python. As always, feel free to share your thoughts, suggestions, or ideas for future posts. Subscribe to my email list to stay updated on new content, usually published weekly on Sundays!