Sarvam AI Voice-to-Text Integration - Implementation Summary

Overview

Successfully delivered a flexible, user-configurable voice dictation and translation system utilizing Sarvam AI, removing the feature's dependency on Sypha account requirements.

Implementation Date

November 2025

What Was Implemented

1. Protocol Buffers & State Management

Files Modified:

proto/sypha/state.proto
src/shared/DictationSettings.ts
src/shared/storage/state-keys.ts
src/core/storage/utils/state-helpers.ts

Changes:

Extended DictationSettings message with:
- transcription_provider (string): "sypha" or "sarvam"
- transcription_language (string): Language code for transcription
- enable_translation (bool): Whether to translate transcribed text
- translation_target_language (string): Target language for translation
Added sarvam_api_key to Secrets message for secure API key storage

2. Backend Services

New Files Created:

src/services/dictation/ITranscriptionService.ts

Interface for transcription services
Defines TranscriptionResult and ITranscriptionService interface
Enables provider abstraction

src/services/dictation/SyphaTranscriptionService.ts

Refactored from VoiceTranscriptionService.ts
Implements ITranscriptionService interface
Maintains backward compatibility with Sypha provider

src/services/dictation/SarvamTranscriptionService.ts

Implements Sarvam AI speech-to-text transcription
Supports Indian languages
Comprehensive error handling
API endpoint: https://api.sarvam.ai/speech-to-text

src/services/dictation/SarvamTranslationService.ts

Implements Sarvam AI text translation
Translates between Indian languages
Supports batch translation (future-ready)
API endpoint: https://api.sarvam.ai/translate

src/services/dictation/TranscriptionServiceFactory.ts

Factory pattern for provider selection
Returns appropriate service based on provider string
Validates API keys and requirements

src/shared/sarvam/constants.ts

Sarvam AI API endpoints
Language code mappings (internal to Sarvam format)
Supported languages list
Utility functions for language validation

Files Modified:

src/core/controller/dictation/transcribeAudio.ts

Updated to use factory pattern for service selection
Reads provider from dictation settings
Fetches API keys from secrets
Implements translation pipeline
Enhanced error handling and telemetry

src/core/controller/state/updateSettings.ts

Extended dictation settings handler
Stores new fields: provider, transcription language, translation settings

3. Frontend Components

New Files Created:

webview-ui/src/components/settings/common/ApiKeyField.tsx

Reusable password-style input component
Show/hide toggle for API keys
Help text with external links
Secure input handling

webview-ui/src/components/settings/sections/DictationSettingsSection.tsx

Comprehensive UI for dictation configuration
Provider selection dropdown
API key input (for Sarvam AI)
Transcription language selection
Translation toggle and target language selection
Context-sensitive help tooltips

Files Modified:

webview-ui/src/components/settings/sections/FeatureSettingsSection-sypha.tsx

Integrated DictationSettingsSection component
Replaced old dictation language dropdown with new comprehensive settings

webview-ui/src/components/settings/utils/settingsHandlers.ts

Added updateSecret() function for API key storage
Sends secure messages to extension for secret storage

4. Message Handling

Files Modified:

src/shared/WebviewMessage.ts

Added "updateSecret" message type
Added secretKey and value fields for secret updates

src/shared/ExtensionMessage.ts

Added "updateSecret" to message type union
Added corresponding fields for secret handling

src/hosts/vscode/VscodeWebviewProvider.ts

Added message handler for "updateSecret" type
Stores secrets securely in VSCode secrets storage
Proper error handling and logging

5. Documentation

docs/features/voice-dictation-sarvam.md

Comprehensive user guide
Setup instructions
Language support reference
Translation feature explanation
Troubleshooting guide
FAQ section
Best practices
Privacy and security information

Key Features Implemented

1. Provider Selection

Users can choose between:
- Sypha (deprecated, requires Sypha account)
- Sarvam AI (recommended, requires API key)

2. Multi-Language Support

Sarvam AI supports 11+ languages:

English (India)
Hindi, Bengali, Gujarati, Kannada
Malayalam, Marathi, Odia, Punjabi
Tamil, Telugu

3. Translation Pipeline

Optional translation after transcription
Speak in one language, send in another
Supports all Sarvam language pairs
Graceful fallback on translation errors

4. Secure API Key Management

API keys stored in VSCode encrypted secrets
Never exposed in logs or telemetry
Show/hide toggle in UI
Secure transmission from webview to extension

5. Error Handling

Provider-specific error messages
Network error detection
API key validation
Rate limiting handling
User-friendly error descriptions

6. Telemetry

Existing telemetry captures provider information
Tracks transcription success/failure by provider
Translation success/failure tracking
No sensitive data logged

API Flow

Transcription Flow:

1. User clicks microphone → Audio recorded
2. Audio (base64) sent to backend
3. Backend reads dictation settings:
   - Provider: sarvam
   - API Key: from secrets
   - Language: hi-IN
4. Factory creates SarvamTranscriptionService
5. Service calls Sarvam AI API
6. Transcript returned: "नमस्ते"
7. If translation enabled:
   - Create SarvamTranslationService
   - Translate to target language (en-IN)
   - Return: "Hello"
8. Text appears in chat input

Settings Update Flow:

1. User enters API key in settings
2. updateSecret() called in frontend
3. Message sent to extension
4. VscodeWebviewProvider receives message
5. Secret stored via context.secrets.store()
6. API key available for next transcription

Architecture Decisions

1. Service Abstraction

Decision: Created ITranscriptionService interface

Rationale:

Easy to add new providers in future
Testable architecture
Clean separation of concerns
Provider-agnostic controller code

2. Factory Pattern

Decision: Used factory for service instantiation

Rationale:

Centralized provider validation
Easy to extend with new providers
API key validation at creation time
Fail-fast approach

3. Optional Translation

Decision: Translation as separate, optional step

Rationale:

Not all users need translation
Keeps transcription and translation separate
Graceful degradation if translation fails
Clear user control

4. Secure Secret Storage

Decision: Use VSCode secrets API instead of regular settings

Rationale:

API keys are sensitive data
Encrypted at rest
Not synced to public repositories
OS-level security

5. Backward Compatibility

Decision: Keep Sypha provider as default

Rationale:

Existing users not affected
Smooth migration path
No breaking changes
Deprecation warnings guide users

Testing Considerations

Manual Testing Required:

✅ Provider selection (Sypha ↔ Sarvam)
✅ API key storage and retrieval
✅ Transcription with Sarvam AI (requires real API key)
✅ Translation with different language pairs
✅ Error handling (invalid API key, network errors)
✅ Settings persistence
✅ UI rendering on different screen sizes

Edge Cases to Test:

Empty API key
Invalid API key
Network disconnection during transcription
Translation failure (should fallback to original)
Very long audio recordings
Multiple rapid transcriptions
Provider switching mid-session

Known Limitations

Translation only with Sarvam AI: Sypha provider doesn't support translation
Indian Languages Focus: Sarvam AI specializes in Indian languages
API Key Required: Users must obtain their own Sarvam AI key
No Offline Mode: Requires internet connection
Audio Format: Currently supports WebM; may need conversion for other formats