Sypha AI Docs
Running models locally

Local Models Overview

Operating Sypha with Local Models

Execute Sypha entirely offline using truly capable models on your personal hardware. Eliminate API expenses, keep data on your machine, and remove internet requirements.

Local models have achieved a milestone where they're now viable for genuine development projects. This comprehensive guide explains everything required to operate Sypha with local models.

Getting Started Quickly

  1. Verify your hardware - Minimum 32GB+ RAM required
  2. Select your runtime - LM Studio or Ollama
  3. Obtain Qwen3 Coder 30B - Our recommended model
  4. Adjust settings - Activate compact prompts, configure maximum context
  5. Begin coding - Entirely offline

System Requirements

Your available RAM dictates which models can operate effectively:

RAMRecommended ModelQuantizationPerformance Level
32GBQwen3 Coder 30B4-bitEntry-level local coding
64GBQwen3 Coder 30B8-bitFull Sypha features
128GB+GLM-4.5-Air4-bitCloud-competitive performance

Optimal Model Selection

Top Choice: Qwen3 Coder 30B

Following comprehensive testing, Qwen3 Coder 30B proves to be the most dependable model below 70B parameters for Sypha:

  • 256K native context window - Process complete repositories
  • Robust tool-use capabilities - Dependable command execution
  • Repository-scale understanding - Preserves context throughout files
  • Established reliability - Consistent outputs with Sypha's tool format

File sizes:

  • 4-bit: ~17GB (optimal for 32GB RAM)
  • 8-bit: ~32GB (optimal for 64GB RAM)
  • 16-bit: ~60GB (needs 128GB+ RAM)

Why Smaller Models Fall Short

The majority of models below 30B parameters (7B-20B) don't work well with Sypha due to:

  • Generating malformed tool-use outputs
  • Declining to execute commands
  • Inability to preserve conversation context
  • Difficulty with complex coding challenges

Available Runtime Platforms

LM Studio

  • Advantages: Intuitive GUI, straightforward model management, integrated server
  • Drawbacks: UI memory consumption, single model limitation
  • Ideal for: Desktop users seeking simplicity
  • Setup Guide →

Ollama

  • Advantages: Terminal-based operation, reduced memory footprint, automation-friendly
  • Drawbacks: Terminal proficiency needed, manual model handling
  • Ideal for: Advanced users and server implementations
  • Setup Guide →

Essential Configuration

Mandatory Settings

Within Sypha:

  • ✅ Activate "Use Compact Prompt" - Decreases prompt size by 90%
  • ✅ Configure appropriate model in settings
  • ✅ Set Base URL to correspond with your server

Within LM Studio:

  • Context Length: 262144 (upper limit)
  • KV Cache Quantization: OFF (essential for proper operation)
  • Flash Attention: ON (if your hardware permits)

Within Ollama:

  • Configure context window: num_ctx 262144
  • Activate flash attention if supported

Quantization Explained

Quantization decreases model precision to accommodate consumer hardware:

TypeSize ReductionQualityUse Case
4-bit~75%GoodMost coding tasks, limited RAM
8-bit~50%BetterProfessional work, more nuance
16-bitNoneBestMaximum quality, requires high RAM

Available Model Formats

GGUF (Universal)

  • Compatible with all platforms (Windows, Linux, Mac)
  • Wide-ranging quantization options
  • Enhanced tool compatibility
  • Suggested for most users

MLX (Mac only)

  • Tailored for Apple Silicon (M1/M2/M3)
  • Utilizes Metal and AMX acceleration
  • Accelerated inference on Mac
  • Demands macOS 13+

Expected Performance Characteristics

Typical Behavior

  • Initial load time: 10-30 seconds for model initialization
  • Token generation: 5-20 tokens/second on standard hardware
  • Context processing: Reduced speed with extensive codebases
  • Memory usage: Approximates your quantization size

Optimization Strategies

  1. Utilize compact prompts - Critical for local inference
  2. Constrain context where feasible - Begin with reduced windows
  3. Select appropriate quantization - Balance quality versus speed
  4. Terminate other applications - Release RAM for the model
  5. Employ SSD storage - Accelerated model loading

Application Scenarios Compared

Situations Favoring Local Models

Optimal for:

  • Development environments without connectivity
  • Projects with privacy requirements
  • Learning experiences without API expenses
  • Unrestricted experimentation
  • Isolated network environments
  • Budget-conscious development

Situations Favoring Cloud Models

☁️ Preferable for:

  • Extensive codebases (>256K tokens)
  • Extended refactoring sessions
  • Teams requiring uniform performance
  • Cutting-edge model features
  • Projects with tight deadlines

Problem Resolution

Frequent Issues & Remedies

"Shell integration unavailable"

  • Change to bash in Sypha Settings → Terminal → Default Terminal Profile
  • Addresses 90% of terminal integration issues

"No connection could be made"

  • Confirm server is operational (LM Studio or Ollama)
  • Validate Base URL corresponds to server address
  • Verify no firewall obstruction exists
  • Standard ports: LM Studio (1234), Ollama (11434)

Delayed or partial responses

  • Expected for local models (5-20 tokens/sec standard)
  • Attempt lower quantization (4-bit rather than 8-bit)
  • Activate compact prompts if inactive
  • Decrease context window dimensions

Model disorientation or failures

  • Confirm KV Cache Quantization is OFF (LM Studio)
  • Verify compact prompts are activated
  • Validate context length configured to maximum
  • Check adequate RAM for quantization

Performance Enhancement

For accelerated inference:

  1. Apply 4-bit quantization
  2. Activate Flash Attention
  3. Decrease context window if unnecessary
  4. Terminate non-essential applications
  5. Utilize NVMe SSD for model storage

For enhanced quality:

  1. Apply 8-bit or superior quantization
  2. Expand context window to maximum
  3. Maintain adequate cooling
  4. Dedicate maximum RAM to model

Advanced Setup

Multi-GPU Configuration

With multiple GPUs available, model layers can be distributed:

  • LM Studio: Automated GPU detection
  • Ollama: Configure num_gpu parameter

Alternative Models

While Qwen3 Coder 30B is suggested, experimentation with these is possible:

  • DeepSeek Coder V2
  • Codestral 22B
  • StarCoder2 15B

Note: These alternatives may demand extra configuration and testing.

Community Resources & Assistance

Proceeding Forward

Prepared to begin? Select your approach:

Conclusion

Local models integrated with Sypha have become truly practical. Though they can't compete with premium cloud APIs in speed, they deliver total privacy, zero expenses, and offline functionality. With appropriate configuration and suitable hardware, Qwen3 Coder 30B handles the majority of coding tasks effectively.

Success hinges on proper setup: sufficient RAM, accurate configuration, and reasonable expectations. Adhere to this guide, and you'll have a competent coding assistant operating completely on your hardware.

On this page