Local Models Overview
Operating Sypha with Local Models
Execute Sypha entirely offline using truly capable models on your personal hardware. Eliminate API expenses, keep data on your machine, and remove internet requirements.
Local models have achieved a milestone where they're now viable for genuine development projects. This comprehensive guide explains everything required to operate Sypha with local models.
Getting Started Quickly
- Verify your hardware - Minimum 32GB+ RAM required
- Select your runtime - LM Studio or Ollama
- Obtain Qwen3 Coder 30B - Our recommended model
- Adjust settings - Activate compact prompts, configure maximum context
- Begin coding - Entirely offline
System Requirements
Your available RAM dictates which models can operate effectively:
| RAM | Recommended Model | Quantization | Performance Level |
|---|---|---|---|
| 32GB | Qwen3 Coder 30B | 4-bit | Entry-level local coding |
| 64GB | Qwen3 Coder 30B | 8-bit | Full Sypha features |
| 128GB+ | GLM-4.5-Air | 4-bit | Cloud-competitive performance |
Optimal Model Selection
Top Choice: Qwen3 Coder 30B
Following comprehensive testing, Qwen3 Coder 30B proves to be the most dependable model below 70B parameters for Sypha:
- 256K native context window - Process complete repositories
- Robust tool-use capabilities - Dependable command execution
- Repository-scale understanding - Preserves context throughout files
- Established reliability - Consistent outputs with Sypha's tool format
File sizes:
- 4-bit: ~17GB (optimal for 32GB RAM)
- 8-bit: ~32GB (optimal for 64GB RAM)
- 16-bit: ~60GB (needs 128GB+ RAM)
Why Smaller Models Fall Short
The majority of models below 30B parameters (7B-20B) don't work well with Sypha due to:
- Generating malformed tool-use outputs
- Declining to execute commands
- Inability to preserve conversation context
- Difficulty with complex coding challenges
Available Runtime Platforms
LM Studio
- Advantages: Intuitive GUI, straightforward model management, integrated server
- Drawbacks: UI memory consumption, single model limitation
- Ideal for: Desktop users seeking simplicity
- Setup Guide →
Ollama
- Advantages: Terminal-based operation, reduced memory footprint, automation-friendly
- Drawbacks: Terminal proficiency needed, manual model handling
- Ideal for: Advanced users and server implementations
- Setup Guide →
Essential Configuration
Mandatory Settings
Within Sypha:
- ✅ Activate "Use Compact Prompt" - Decreases prompt size by 90%
- ✅ Configure appropriate model in settings
- ✅ Set Base URL to correspond with your server
Within LM Studio:
- Context Length:
262144(upper limit) - KV Cache Quantization:
OFF(essential for proper operation) - Flash Attention:
ON(if your hardware permits)
Within Ollama:
- Configure context window:
num_ctx 262144 - Activate flash attention if supported
Quantization Explained
Quantization decreases model precision to accommodate consumer hardware:
| Type | Size Reduction | Quality | Use Case |
|---|---|---|---|
| 4-bit | ~75% | Good | Most coding tasks, limited RAM |
| 8-bit | ~50% | Better | Professional work, more nuance |
| 16-bit | None | Best | Maximum quality, requires high RAM |
Available Model Formats
GGUF (Universal)
- Compatible with all platforms (Windows, Linux, Mac)
- Wide-ranging quantization options
- Enhanced tool compatibility
- Suggested for most users
MLX (Mac only)
- Tailored for Apple Silicon (M1/M2/M3)
- Utilizes Metal and AMX acceleration
- Accelerated inference on Mac
- Demands macOS 13+
Expected Performance Characteristics
Typical Behavior
- Initial load time: 10-30 seconds for model initialization
- Token generation: 5-20 tokens/second on standard hardware
- Context processing: Reduced speed with extensive codebases
- Memory usage: Approximates your quantization size
Optimization Strategies
- Utilize compact prompts - Critical for local inference
- Constrain context where feasible - Begin with reduced windows
- Select appropriate quantization - Balance quality versus speed
- Terminate other applications - Release RAM for the model
- Employ SSD storage - Accelerated model loading
Application Scenarios Compared
Situations Favoring Local Models
✅ Optimal for:
- Development environments without connectivity
- Projects with privacy requirements
- Learning experiences without API expenses
- Unrestricted experimentation
- Isolated network environments
- Budget-conscious development
Situations Favoring Cloud Models
☁️ Preferable for:
- Extensive codebases (>256K tokens)
- Extended refactoring sessions
- Teams requiring uniform performance
- Cutting-edge model features
- Projects with tight deadlines
Problem Resolution
Frequent Issues & Remedies
"Shell integration unavailable"
- Change to bash in Sypha Settings → Terminal → Default Terminal Profile
- Addresses 90% of terminal integration issues
"No connection could be made"
- Confirm server is operational (LM Studio or Ollama)
- Validate Base URL corresponds to server address
- Verify no firewall obstruction exists
- Standard ports: LM Studio (1234), Ollama (11434)
Delayed or partial responses
- Expected for local models (5-20 tokens/sec standard)
- Attempt lower quantization (4-bit rather than 8-bit)
- Activate compact prompts if inactive
- Decrease context window dimensions
Model disorientation or failures
- Confirm KV Cache Quantization is OFF (LM Studio)
- Verify compact prompts are activated
- Validate context length configured to maximum
- Check adequate RAM for quantization
Performance Enhancement
For accelerated inference:
- Apply 4-bit quantization
- Activate Flash Attention
- Decrease context window if unnecessary
- Terminate non-essential applications
- Utilize NVMe SSD for model storage
For enhanced quality:
- Apply 8-bit or superior quantization
- Expand context window to maximum
- Maintain adequate cooling
- Dedicate maximum RAM to model
Advanced Setup
Multi-GPU Configuration
With multiple GPUs available, model layers can be distributed:
- LM Studio: Automated GPU detection
- Ollama: Configure
num_gpuparameter
Alternative Models
While Qwen3 Coder 30B is suggested, experimentation with these is possible:
- DeepSeek Coder V2
- Codestral 22B
- StarCoder2 15B
Note: These alternatives may demand extra configuration and testing.
Community Resources & Assistance
- Discord: Join our community for immediate assistance
- Reddit: r/sypha for community discussions
- GitHub: Report issues
Proceeding Forward
Prepared to begin? Select your approach:
LM Studio Setup
User-friendly GUI approach with detailed configuration guide
Ollama Setup
Command-line setup for power users and automation
Conclusion
Local models integrated with Sypha have become truly practical. Though they can't compete with premium cloud APIs in speed, they deliver total privacy, zero expenses, and offline functionality. With appropriate configuration and suitable hardware, Qwen3 Coder 30B handles the majority of coding tasks effectively.
Success hinges on proper setup: sufficient RAM, accurate configuration, and reasonable expectations. Adhere to this guide, and you'll have a competent coding assistant operating completely on your hardware.