Read Me First
Operating Local Models with Sypha
Local models have achieved a significant milestone. Now, for the first time, Sypha can operate entirely offline using truly capable models. Eliminate API expenses, keep data on your machine, and remove internet requirements.
Success depends on selecting the appropriate model for your hardware and configuring it correctly.
Essential Information
System Requirements
Your available RAM dictates which models can operate:
| RAM Tier | Recommended Model | Quantization | What You Get |
|---|---|---|---|
| 32GB | Qwen3 Coder 30B | 4-bit | Entry-level local coding |
| 64GB | Qwen3 Coder 30B | 8-bit | Full Sypha features |
| 128GB+ | GLM-4.5-Air | 4-bit | Cloud-competitive performance |
The Proven Model: Qwen3 Coder 30B
Following comprehensive testing, Qwen3 Coder 30B stands as the sole model below 70B parameters that dependably functions with Sypha. It provides:
- 256K native context window
- Robust tool-use capabilities
- Repository-scale understanding
- Dependable command execution
The majority of smaller models (7B-20B) don't succeed with Sypha. They generate malformed outputs, decline to execute commands, or struggle with tool integration.
Essential Configuration
Operating local models demands specific settings:
For LM Studio:
- Context Length: 262,144 (upper limit)
- KV Cache Quantization: OFF (essential)
- Flash Attention: ON (if supported)
For All Local Models:
- Activate "Use Compact Prompt" in Sypha settings
- This decreases prompt size by 90% while preserving core functionality
- Critical for local inference performance
Understanding Quantization
Quantization decreases model precision to accommodate consumer hardware. Consider it compression:
- 4-bit: ~75% size reduction. Fully functional for coding tasks.
- 8-bit: ~50% size reduction. Enhanced quality, more detailed responses.
- 16-bit: Full precision. Equals cloud APIs but demands 4x the memory.
For Qwen3 Coder 30B:
- 4-bit: ~17GB download
- 8-bit: ~32GB download
- 16-bit: ~60GB download
Model Format Selection
Select according to your platform:
MLX (Mac only)
- Tailored for Apple Silicon
- Utilizes Metal and AMX acceleration
- Accelerated inference on M1/M2/M3 chips
GGUF (Universal)
- Compatible with Windows, Linux, and Mac
- Wide-ranging quantization options
- Enhanced tool compatibility
Expected Performance Behavior
Local models operate differently from cloud APIs:
Anticipate:
- Initialization time upon first loading (normal, occurs once)
- Reduced inference speed compared to cloud models
- Context processing slows with extensive repositories
Don't Anticipate:
- Immediate responses like cloud APIs
- Unrestricted context processing velocity
- Zero configuration requirements
Optimal Use Cases for Local Models
Deploy local models for:
- Development environments with unreliable connectivity
- Privacy-focused projects requiring data containment
- Budget-conscious development avoiding API charges
- Learning and experimentation requiring unlimited usage
Optimal Use Cases for Cloud Models
Cloud models maintain advantages for:
- Extensive repositories surpassing local context limitations
- Prolonged refactoring sessions demanding maximum context
- Teams needing uniform performance across varied hardware
- Tasks demanding cutting-edge model capabilities
Frequent Problems
"Shell integration unavailable" or command execution failures
Change to a simpler shell through Sypha settings. Navigate to Sypha Settings → Terminal → Default Terminal Profile and choose "bash". This addresses 90% of terminal integration difficulties.
"No connection could be made"
Your local server (Ollama or LM Studio) isn't operational, or operates on a different port. Verify that:
- The server is truly running
- The Base URL in Sypha settings corresponds to your server's address
- No firewall blocks the connection
Delayed or partial responses
This represents normal behavior for local models. They're considerably slower than cloud APIs. If excessively slow:
- Attempt lower quantization (4-bit rather than 8-bit)
- Decrease context window dimensions
- Activate compact prompts if not already enabled
Model appears disoriented or generates errors
Confirm you have:
- Compact prompts activated
- KV Cache Quantization deactivated (LM Studio)
- Context length configured to maximum
- Adequate RAM for your selected quantization
Beginning the Process
- Select your runtime: LM Studio or Ollama
- Obtain Qwen3 Coder 30B in the suitable quantization for your RAM
- Adjust critical settings as described above
- Activate compact prompts in Sypha settings
- Begin coding offline
The Truth About Local Models
Local models have become genuinely valuable for coding tasks, though they're not miraculous. You're exchanging some convenience and speed for privacy and cost reduction. The configuration demands careful attention, and performance won't equal premium cloud APIs.
However, for the first time, operating a capable coding agent entirely on your laptop is possible. That represents a significant achievement.
Seeking Assistance?
- Join our Discord community
- Visit r/sypha on Reddit
- Consult the LM Studio guide for comprehensive setup
- Review the Ollama guide for alternative setup