User Guide
Currently, DuRT provides features such as real-time speech recognition, real-time translation of recognition results, support for recording, saving recognition results, copying results to the clipboard during recognition, and saving translation results.
This document explains how to use DuRT, including pre-use considerations and main features.
Speech Recognition
DuRT currently integrates three mainstream types of speech recognition: streaming speech recognition, non-streaming speech recognition, and Apple's built-in speech recognition.
Below is a comparison of these three types of speech recognition.
Recognition Type | Streaming | Non-Streaming | Apple |
---|---|---|---|
Recognition Quality | Good | Excellent | Excellent |
Model Download | Streaming Recognition Model | Whisper Model | No download required |
Punctuation | Not Supported | Supported | Supported |
Recognition Speed | Real-Time | Near Real-Time | Real-Time |
Supported Languages | 4 | 30+ | 30+ |
Language Switching During Recognition | Not Supported | Supported | Supported |
Local Only | Yes | Yes | Depends on Conditions |
Translation | Supported | Supported | Supported |
Save Recording Save Recognition Result Save Translation Result | Supported | Supported | Supported |
From experience, Apple and Whisper non-streaming recognition provide the best accuracy. Each language may vary in accuracy, so it’s recommended to try each method.
Permission Requests
Permissions need to be set before using speech recognition. DuRT supports both device audio and microphone audio recognition.
For recognizing device audio, Screen Recording and System Audio Permissions are required. Go to Settings > Privacy & Security > Screen Recording & System Audio to allow DuRT to use screen and audio recording, as shown below:
For recognizing microphone audio, Microphone Permission is required. Go to Settings > Privacy & Security > Microphone to allow DuRT to access the microphone, as shown below:
To save recordings, recognition results, or translation results, you need to choose a directory for storage. Set this directory in DuRT’s settings page.
For security, DuRT only accesses screen or microphone permissions while recognition is running.
Streaming Speech Recognition
Audio Source: Choose between device audio and microphone audio.
Recognition Type: Select "Streaming" when using streaming speech recognition.
Language Selection: Choose from languages supported by downloaded streaming models.
Model Selection: Select the streaming model to use.
Save Audio: Begins saving the monitored audio in the selected directory once recognition starts. This directory can be set in the settings page. If not set, it will prompt you to choose a directory when enabling this feature.
Save Recognition Results: Saves recognized results as a .txt file in the selected directory.
Enable Translation: Translates recognition results into the selected language.
Translation Language Selection: Choose the target language for translation.
Save Translation Results: Saves translated results as a .txt file in the selected directory.
Display Floating Window: Shows recognition and translation results (if enabled) in a floating window.
Start: Starts recognition. Loading the model takes about 1–3 seconds initially.
Memory Usage
Streaming recognition requires downloading the corresponding model, which ranges from 200MB to 500MB. During operation, it requires about twice the model size in memory.
Model Information
Currently, DuRT supports streaming models for English, Chinese, Korean, and French. You must download the corresponding model to recognize the desired language. See Model Download.
Non-Streaming Speech Recognition
Recognition Type: Select "Non-Streaming" for non-streaming speech recognition.
Recognition Interval: Set the interval for speech recognition, with a range from 2 to 10 seconds. The optimal experience is typically around 3–5 seconds.
Other settings are the same as those in streaming recognition; refer to the above explanations.
Memory Usage
Non-streaming recognition requires downloading the Whisper model, which ranges from 200MB to 1GB. It requires about twice the model size in memory during operation.
Model Information
The non-streaming model uses the Whisper model, available in sizes like tiny, small, base, medium, large, and turbo.
Models with "-en" in the name are limited to English recognition.
Generally, larger models provide better recognition accuracy.
Supported languages for Whisper include: Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Italian, Japanese, Korean, Macedonian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.
To use non-streaming speech recognition, download the Whisper model. See Model Download.
The recommended model is whisper-large-v3-turbo.
Apple Speech Recognition
Apple speech recognition uses the macOS built-in feature. In addition to screen recording and microphone permissions, two more permissions are required.
Enable Speech Recognition under Privacy & Security > Speech Recognition to allow DuRT access, as shown below:
Enable Keyboard Dictation in Settings > Keyboard > Dictation, as shown below:
Apple speech recognition has two modes: local and server-based, depending on Apple’s configuration:
Mode | Local | Server |
---|---|---|
Accuracy | Good | Good |
Limitations | None | Daily request limits |
Supported Languages | 10+ | 50+ |
Note: Server-based recognition may send audio data to Apple's servers.
DuRT indicates if a language supports local recognition, as shown above with "Run: Local."
If local recognition is available, DuRT will use it; otherwise, it defaults to server-based recognition. If server-based recognition doesn’t return results, it may be due to reaching the daily limit.
Memory Usage
Apple’s recognition uses the system’s speech recognition service, so memory usage is minimal.
Model Information
Apple’s speech recognition doesn’t require downloading a model; you only need to grant permissions.
Text Translation
To use text translation, first download the translation model from the download page.
Enable Translation in the recognition interface and select the target language. As shown below, you can switch the translation language during recognition.
Memory Usage
Enabling translation requires around 1.5GB of memory.
Model Information
Translation requires downloading the translation model. See Model Download.
Usage Tips
First, grant screen recording and microphone permissions for the app to function.
DuRT includes the non-streaming Whisper-base model by default, which allows near-streaming recognition for 30+ languages by setting the interval.
Granting system speech recognition permissions and enabling keyboard dictation allows Apple speech recognition, supporting numerous languages and punctuation.
After downloading a translation model, the translation feature is available.
After downloading a streaming model, you can use the model for specific language recognition.
For improved recognition accuracy, download a larger Whisper model.
Downloading both Whisper and translation models enables full Whisper recognition and translation functionality.