Local Models Overview

Operating Sypha with Local Models

Execute Sypha entirely offline using truly capable models on your personal hardware. Eliminate API expenses, keep data on your machine, and remove internet requirements.

Local models have achieved a milestone where they're now viable for genuine development projects. This comprehensive guide explains everything required to operate Sypha with local models.

Getting Started Quickly

Verify your hardware - Minimum 32GB+ RAM required
Select your runtime - LM Studio or Ollama
Obtain Qwen3 Coder 30B - Our recommended model
Adjust settings - Activate compact prompts, configure maximum context
Begin coding - Entirely offline

System Requirements

Your available RAM dictates which models can operate effectively:

RAM	Recommended Model	Quantization	Performance Level
32GB	Qwen3 Coder 30B	4-bit	Entry-level local coding
64GB	Qwen3 Coder 30B	8-bit	Full Sypha features
128GB+	GLM-4.5-Air	4-bit	Cloud-competitive performance

Optimal Model Selection

Top Choice: Qwen3 Coder 30B

Following comprehensive testing, Qwen3 Coder 30B proves to be the most dependable model below 70B parameters for Sypha:

256K native context window - Process complete repositories
Robust tool-use capabilities - Dependable command execution
Repository-scale understanding - Preserves context throughout files
Established reliability - Consistent outputs with Sypha's tool format

File sizes:

4-bit: ~17GB (optimal for 32GB RAM)
8-bit: ~32GB (optimal for 64GB RAM)
16-bit: ~60GB (needs 128GB+ RAM)

Why Smaller Models Fall Short

The majority of models below 30B parameters (7B-20B) don't work well with Sypha due to:

Generating malformed tool-use outputs
Declining to execute commands
Inability to preserve conversation context
Difficulty with complex coding challenges

Available Runtime Platforms

LM Studio

Advantages: Intuitive GUI, straightforward model management, integrated server
Drawbacks: UI memory consumption, single model limitation
Ideal for: Desktop users seeking simplicity
Setup Guide →

Ollama

Advantages: Terminal-based operation, reduced memory footprint, automation-friendly
Drawbacks: Terminal proficiency needed, manual model handling
Ideal for: Advanced users and server implementations
Setup Guide →

Essential Configuration

Mandatory Settings

Within Sypha:

✅ Activate "Use Compact Prompt" - Decreases prompt size by 90%
✅ Configure appropriate model in settings
✅ Set Base URL to correspond with your server

Within LM Studio:

Context Length: 262144 (upper limit)
KV Cache Quantization: OFF (essential for proper operation)
Flash Attention: ON (if your hardware permits)

Within Ollama:

Configure context window: num_ctx 262144
Activate flash attention if supported

Quantization Explained

Quantization decreases model precision to accommodate consumer hardware:

Type	Size Reduction	Quality	Use Case
4-bit	~75%	Good	Most coding tasks, limited RAM
8-bit	~50%	Better	Professional work, more nuance
16-bit	None	Best	Maximum quality, requires high RAM

Available Model Formats

GGUF (Universal)

Compatible with all platforms (Windows, Linux, Mac)
Wide-ranging quantization options
Enhanced tool compatibility
Suggested for most users

MLX (Mac only)

Tailored for Apple Silicon (M1/M2/M3)
Utilizes Metal and AMX acceleration
Accelerated inference on Mac
Demands macOS 13+

Expected Performance Characteristics

Typical Behavior

Initial load time: 10-30 seconds for model initialization
Token generation: 5-20 tokens/second on standard hardware
Context processing: Reduced speed with extensive codebases
Memory usage: Approximates your quantization size

Optimization Strategies

Utilize compact prompts - Critical for local inference
Constrain context where feasible - Begin with reduced windows
Select appropriate quantization - Balance quality versus speed
Terminate other applications - Release RAM for the model
Employ SSD storage - Accelerated model loading

Application Scenarios Compared

Situations Favoring Local Models

✅ Optimal for:

Development environments without connectivity
Projects with privacy requirements
Learning experiences without API expenses
Unrestricted experimentation
Isolated network environments
Budget-conscious development

Situations Favoring Cloud Models

☁️ Preferable for:

Extensive codebases (>256K tokens)
Extended refactoring sessions
Teams requiring uniform performance
Cutting-edge model features
Projects with tight deadlines

Problem Resolution

Frequent Issues & Remedies

"Shell integration unavailable"

Change to bash in Sypha Settings → Terminal → Default Terminal Profile
Addresses 90% of terminal integration issues

"No connection could be made"

Confirm server is operational (LM Studio or Ollama)
Validate Base URL corresponds to server address
Verify no firewall obstruction exists
Standard ports: LM Studio (1234), Ollama (11434)

Delayed or partial responses

Expected for local models (5-20 tokens/sec standard)
Attempt lower quantization (4-bit rather than 8-bit)
Activate compact prompts if inactive
Decrease context window dimensions

Model disorientation or failures

Confirm KV Cache Quantization is OFF (LM Studio)
Verify compact prompts are activated
Validate context length configured to maximum
Check adequate RAM for quantization

Performance Enhancement

For accelerated inference:

Apply 4-bit quantization
Activate Flash Attention
Decrease context window if unnecessary
Terminate non-essential applications
Utilize NVMe SSD for model storage

For enhanced quality:

Apply 8-bit or superior quantization
Expand context window to maximum
Maintain adequate cooling
Dedicate maximum RAM to model

Advanced Setup

Multi-GPU Configuration

With multiple GPUs available, model layers can be distributed:

LM Studio: Automated GPU detection
Ollama: Configure num_gpu parameter

Alternative Models

While Qwen3 Coder 30B is suggested, experimentation with these is possible:

DeepSeek Coder V2
Codestral 22B
StarCoder2 15B

Note: These alternatives may demand extra configuration and testing.

Community Resources & Assistance

Discord: Join our community for immediate assistance
Reddit: r/sypha for community discussions
GitHub: Report issues

Proceeding Forward

Prepared to begin? Select your approach:

desktop

LM Studio Setup

User-friendly GUI approach with detailed configuration guide

terminal

Ollama Setup

Command-line setup for power users and automation

Local models integrated with Sypha have become truly practical. Though they can't compete with premium cloud APIs in speed, they deliver total privacy, zero expenses, and offline functionality. With appropriate configuration and suitable hardware, Qwen3 Coder 30B handles the majority of coding tasks effectively.

Success hinges on proper setup: sufficient RAM, accurate configuration, and reasonable expectations. Adhere to this guide, and you'll have a competent coding assistant operating completely on your hardware.