Sypha AI Docs
Running models locally

Read Me First

Operating Local Models with Sypha

Local models have achieved a significant milestone. Now, for the first time, Sypha can operate entirely offline using truly capable models. Eliminate API expenses, keep data on your machine, and remove internet requirements.

Success depends on selecting the appropriate model for your hardware and configuring it correctly.

Essential Information

System Requirements

Your available RAM dictates which models can operate:

RAM TierRecommended ModelQuantizationWhat You Get
32GBQwen3 Coder 30B4-bitEntry-level local coding
64GBQwen3 Coder 30B8-bitFull Sypha features
128GB+GLM-4.5-Air4-bitCloud-competitive performance

The Proven Model: Qwen3 Coder 30B

Following comprehensive testing, Qwen3 Coder 30B stands as the sole model below 70B parameters that dependably functions with Sypha. It provides:

  • 256K native context window
  • Robust tool-use capabilities
  • Repository-scale understanding
  • Dependable command execution

The majority of smaller models (7B-20B) don't succeed with Sypha. They generate malformed outputs, decline to execute commands, or struggle with tool integration.

Essential Configuration

Operating local models demands specific settings:

For LM Studio:

  1. Context Length: 262,144 (upper limit)
  2. KV Cache Quantization: OFF (essential)
  3. Flash Attention: ON (if supported)

For All Local Models:

  • Activate "Use Compact Prompt" in Sypha settings
  • This decreases prompt size by 90% while preserving core functionality
  • Critical for local inference performance

Understanding Quantization

Quantization decreases model precision to accommodate consumer hardware. Consider it compression:

  • 4-bit: ~75% size reduction. Fully functional for coding tasks.
  • 8-bit: ~50% size reduction. Enhanced quality, more detailed responses.
  • 16-bit: Full precision. Equals cloud APIs but demands 4x the memory.

For Qwen3 Coder 30B:

  • 4-bit: ~17GB download
  • 8-bit: ~32GB download
  • 16-bit: ~60GB download

Model Format Selection

Select according to your platform:

MLX (Mac only)

  • Tailored for Apple Silicon
  • Utilizes Metal and AMX acceleration
  • Accelerated inference on M1/M2/M3 chips

GGUF (Universal)

  • Compatible with Windows, Linux, and Mac
  • Wide-ranging quantization options
  • Enhanced tool compatibility

Expected Performance Behavior

Local models operate differently from cloud APIs:

Anticipate:

  • Initialization time upon first loading (normal, occurs once)
  • Reduced inference speed compared to cloud models
  • Context processing slows with extensive repositories

Don't Anticipate:

  • Immediate responses like cloud APIs
  • Unrestricted context processing velocity
  • Zero configuration requirements

Optimal Use Cases for Local Models

Deploy local models for:

  • Development environments with unreliable connectivity
  • Privacy-focused projects requiring data containment
  • Budget-conscious development avoiding API charges
  • Learning and experimentation requiring unlimited usage

Optimal Use Cases for Cloud Models

Cloud models maintain advantages for:

  • Extensive repositories surpassing local context limitations
  • Prolonged refactoring sessions demanding maximum context
  • Teams needing uniform performance across varied hardware
  • Tasks demanding cutting-edge model capabilities

Frequent Problems

"Shell integration unavailable" or command execution failures

Change to a simpler shell through Sypha settings. Navigate to Sypha Settings → Terminal → Default Terminal Profile and choose "bash". This addresses 90% of terminal integration difficulties.

"No connection could be made"

Your local server (Ollama or LM Studio) isn't operational, or operates on a different port. Verify that:

  • The server is truly running
  • The Base URL in Sypha settings corresponds to your server's address
  • No firewall blocks the connection

Delayed or partial responses

This represents normal behavior for local models. They're considerably slower than cloud APIs. If excessively slow:

  • Attempt lower quantization (4-bit rather than 8-bit)
  • Decrease context window dimensions
  • Activate compact prompts if not already enabled

Model appears disoriented or generates errors

Confirm you have:

  • Compact prompts activated
  • KV Cache Quantization deactivated (LM Studio)
  • Context length configured to maximum
  • Adequate RAM for your selected quantization

Beginning the Process

  1. Select your runtime: LM Studio or Ollama
  2. Obtain Qwen3 Coder 30B in the suitable quantization for your RAM
  3. Adjust critical settings as described above
  4. Activate compact prompts in Sypha settings
  5. Begin coding offline

The Truth About Local Models

Local models have become genuinely valuable for coding tasks, though they're not miraculous. You're exchanging some convenience and speed for privacy and cost reduction. The configuration demands careful attention, and performance won't equal premium cloud APIs.

However, for the first time, operating a capable coding agent entirely on your laptop is possible. That represents a significant achievement.

Seeking Assistance?

On this page