Read Me First

Operating Local Models with Sypha

Local models have achieved a significant milestone. Now, for the first time, Sypha can operate entirely offline using truly capable models. Eliminate API expenses, keep data on your machine, and remove internet requirements.

Success depends on selecting the appropriate model for your hardware and configuring it correctly.

Essential Information

System Requirements

Your available RAM dictates which models can operate:

RAM Tier	Recommended Model	Quantization	What You Get
32GB	Qwen3 Coder 30B	4-bit	Entry-level local coding
64GB	Qwen3 Coder 30B	8-bit	Full Sypha features
128GB+	GLM-4.5-Air	4-bit	Cloud-competitive performance

The Proven Model: Qwen3 Coder 30B

Following comprehensive testing, Qwen3 Coder 30B stands as the sole model below 70B parameters that dependably functions with Sypha. It provides:

256K native context window
Robust tool-use capabilities
Repository-scale understanding
Dependable command execution

The majority of smaller models (7B-20B) don't succeed with Sypha. They generate malformed outputs, decline to execute commands, or struggle with tool integration.

Essential Configuration

Operating local models demands specific settings:

For LM Studio:

Context Length: 262,144 (upper limit)
KV Cache Quantization: OFF (essential)
Flash Attention: ON (if supported)

For All Local Models:

Activate "Use Compact Prompt" in Sypha settings
This decreases prompt size by 90% while preserving core functionality
Critical for local inference performance

Understanding Quantization

Quantization decreases model precision to accommodate consumer hardware. Consider it compression:

4-bit: ~75% size reduction. Fully functional for coding tasks.
8-bit: ~50% size reduction. Enhanced quality, more detailed responses.
16-bit: Full precision. Equals cloud APIs but demands 4x the memory.

For Qwen3 Coder 30B:

4-bit: ~17GB download
8-bit: ~32GB download
16-bit: ~60GB download

Model Format Selection

Select according to your platform:

MLX (Mac only)

Tailored for Apple Silicon
Utilizes Metal and AMX acceleration
Accelerated inference on M1/M2/M3 chips

GGUF (Universal)

Compatible with Windows, Linux, and Mac
Wide-ranging quantization options
Enhanced tool compatibility

Expected Performance Behavior

Local models operate differently from cloud APIs:

Anticipate:

Initialization time upon first loading (normal, occurs once)
Reduced inference speed compared to cloud models
Context processing slows with extensive repositories

Don't Anticipate:

Immediate responses like cloud APIs
Unrestricted context processing velocity
Zero configuration requirements

Optimal Use Cases for Local Models

Deploy local models for:

Development environments with unreliable connectivity
Privacy-focused projects requiring data containment
Budget-conscious development avoiding API charges
Learning and experimentation requiring unlimited usage

Optimal Use Cases for Cloud Models

Cloud models maintain advantages for:

Extensive repositories surpassing local context limitations
Prolonged refactoring sessions demanding maximum context
Teams needing uniform performance across varied hardware
Tasks demanding cutting-edge model capabilities

Frequent Problems

"Shell integration unavailable" or command execution failures

Change to a simpler shell through Sypha settings. Navigate to Sypha Settings → Terminal → Default Terminal Profile and choose "bash". This addresses 90% of terminal integration difficulties.

"No connection could be made"

Your local server (Ollama or LM Studio) isn't operational, or operates on a different port. Verify that:

The server is truly running
The Base URL in Sypha settings corresponds to your server's address
No firewall blocks the connection

Delayed or partial responses

This represents normal behavior for local models. They're considerably slower than cloud APIs. If excessively slow:

Attempt lower quantization (4-bit rather than 8-bit)
Decrease context window dimensions
Activate compact prompts if not already enabled

Model appears disoriented or generates errors

Confirm you have:

Compact prompts activated
KV Cache Quantization deactivated (LM Studio)
Context length configured to maximum
Adequate RAM for your selected quantization

Beginning the Process

Select your runtime: LM Studio or Ollama
Obtain Qwen3 Coder 30B in the suitable quantization for your RAM
Adjust critical settings as described above
Activate compact prompts in Sypha settings
Begin coding offline

The Truth About Local Models

Local models have become genuinely valuable for coding tasks, though they're not miraculous. You're exchanging some convenience and speed for privacy and cost reduction. The configuration demands careful attention, and performance won't equal premium cloud APIs.

However, for the first time, operating a capable coding agent entirely on your laptop is possible. That represents a significant achievement.

Seeking Assistance?

Join our Discord community
Visit r/sypha on Reddit
Consult the LM Studio guide for comprehensive setup
Review the Ollama guide for alternative setup

On this page