Skip to main content

User Guide

Currently, DuRT provides features such as real-time speech recognition, real-time translation of recognition results, support for recording, saving recognition results, copying results to the clipboard during recognition, and saving translation results.

This document explains how to use DuRT, including pre-use considerations and main features.

Speech Recognition

DuRT currently integrates three mainstream types of speech recognition: streaming speech recognition, non-streaming speech recognition, and Apple's built-in speech recognition.

Below is a comparison of these three types of speech recognition.

Recognition TypeStreamingNon-StreamingApple
Recognition QualityGoodExcellentExcellent
Model DownloadStreaming Recognition ModelWhisper ModelNo download required
PunctuationNot SupportedSupportedSupported
Recognition SpeedReal-TimeNear Real-TimeReal-Time
Supported Languages430+30+
Language Switching During RecognitionNot SupportedSupportedSupported
Local OnlyYesYesDepends on Conditions
TranslationSupportedSupportedSupported
Save Recording
Save Recognition Result
Save Translation Result
SupportedSupportedSupported

From experience, Apple and Whisper non-streaming recognition provide the best accuracy. Each language may vary in accuracy, so it’s recommended to try each method.

Permission Requests

Permissions need to be set before using speech recognition. DuRT supports both device audio and microphone audio recognition.

For recognizing device audio, Screen Recording and System Audio Permissions are required. Go to Settings > Privacy & Security > Screen Recording & System Audio to allow DuRT to use screen and audio recording, as shown below:

Image description

For recognizing microphone audio, Microphone Permission is required. Go to Settings > Privacy & Security > Microphone to allow DuRT to access the microphone, as shown below:

Image description

To save recordings, recognition results, or translation results, you need to choose a directory for storage. Set this directory in DuRT’s settings page.

For security, DuRT only accesses screen or microphone permissions while recognition is running.

Streaming Speech Recognition

Image description

Audio Source: Choose between device audio and microphone audio.

Recognition Type: Select "Streaming" when using streaming speech recognition.

Language Selection: Choose from languages supported by downloaded streaming models.

Model Selection: Select the streaming model to use.

Save Audio: Begins saving the monitored audio in the selected directory once recognition starts. This directory can be set in the settings page. If not set, it will prompt you to choose a directory when enabling this feature.

Save Recognition Results: Saves recognized results as a .txt file in the selected directory.

Enable Translation: Translates recognition results into the selected language.

Translation Language Selection: Choose the target language for translation.

Save Translation Results: Saves translated results as a .txt file in the selected directory.

Display Floating Window: Shows recognition and translation results (if enabled) in a floating window.

Start: Starts recognition. Loading the model takes about 1–3 seconds initially.

Memory Usage

Streaming recognition requires downloading the corresponding model, which ranges from 200MB to 500MB. During operation, it requires about twice the model size in memory.

Model Information

Currently, DuRT supports streaming models for English, Chinese, Korean, and French. You must download the corresponding model to recognize the desired language. See Model Download.

Non-Streaming Speech Recognition

Image description

Recognition Type: Select "Non-Streaming" for non-streaming speech recognition.

Recognition Interval: Set the interval for speech recognition, with a range from 2 to 10 seconds. The optimal experience is typically around 3–5 seconds.

Other settings are the same as those in streaming recognition; refer to the above explanations.

Memory Usage

Non-streaming recognition requires downloading the Whisper model, which ranges from 200MB to 1GB. It requires about twice the model size in memory during operation.

Model Information

The non-streaming model uses the Whisper model, available in sizes like tiny, small, base, medium, large, and turbo.

Models with "-en" in the name are limited to English recognition.

Generally, larger models provide better recognition accuracy.

Supported languages for Whisper include: Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Italian, Japanese, Korean, Macedonian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.

To use non-streaming speech recognition, download the Whisper model. See Model Download.

The recommended model is whisper-large-v3-turbo.

Apple Speech Recognition

Image description

Apple speech recognition uses the macOS built-in feature. In addition to screen recording and microphone permissions, two more permissions are required.

Enable Speech Recognition under Privacy & Security > Speech Recognition to allow DuRT access, as shown below:

Image description

Enable Keyboard Dictation in Settings > Keyboard > Dictation, as shown below:

Image description

Apple speech recognition has two modes: local and server-based, depending on Apple’s configuration:

ModeLocalServer
AccuracyGoodGood
LimitationsNoneDaily request limits
Supported Languages10+50+

Note: Server-based recognition may send audio data to Apple's servers.

DuRT indicates if a language supports local recognition, as shown above with "Run: Local."

If local recognition is available, DuRT will use it; otherwise, it defaults to server-based recognition. If server-based recognition doesn’t return results, it may be due to reaching the daily limit.

Memory Usage

Apple’s recognition uses the system’s speech recognition service, so memory usage is minimal.

Model Information

Apple’s speech recognition doesn’t require downloading a model; you only need to grant permissions.

Text Translation

To use text translation, first download the translation model from the download page.

Enable Translation in the recognition interface and select the target language. As shown below, you can switch the translation language during recognition.

Image description

Memory Usage

Enabling translation requires around 1.5GB of memory.

Model Information

Translation requires downloading the translation model. See Model Download.

Usage Tips

First, grant screen recording and microphone permissions for the app to function.

DuRT includes the non-streaming Whisper-base model by default, which allows near-streaming recognition for 30+ languages by setting the interval.

Granting system speech recognition permissions and enabling keyboard dictation allows Apple speech recognition, supporting numerous languages and punctuation.

After downloading a translation model, the translation feature is available.

After downloading a streaming model, you can use the model for specific language recognition.

For improved recognition accuracy, download a larger Whisper model.

Downloading both Whisper and translation models enables full Whisper recognition and translation functionality.