Hands-Free, Privacy-First Dev Tooling

VibeType: Building a Local-First Voice Coding Companion

VibeType is a Windows-focused desktop assistant that keeps every syllable on-device while translating your voice into code, commands, and realtime status updates. Below is the full story—how it works, why privacy matters, and what comes next.

Origin Story: From Dictation Fatigue to Flow State

VibeType was born out of frustration. Constant context switching between IDEs, browsers, terminals, design files, and ticket queues made “flow” feel mythical. Dictation tools helped, but every option streamed audio to a vendor cloud and returned plain text with little control. We wanted:

  • Local-first processing so source code and recordings never leave the machine by default.
  • Programmable AI behaviors that respond differently when you say “summarize this diff” versus “draft a terminal command.”
  • Hands-free ergonomics for accessibility, injury recovery, or multitasking without reaching for a keyboard.

VibeType’s guiding principle is simple: give developers a voice-first assistant that is as trustworthy as their favorite text editor.

System Architecture Deep Dive

The app is divided into four major layers that can be mixed and matched:

  1. Capture & Transcription
    Global hotkeys wake the microphone, stream audio through low-latency capture buffers, and pipe it to Whisper (or any local engine you choose). Automatic language detection keeps multilingual sessions coherent.
  2. AI Toolkit
    An Ollama-backed brain executes prompt presets like Assistant, Corrector, Summarizer, or Command Runner. External providers (OpenAI, Cohere, Anthropic) can be toggled per profile, but they are always opt-in.
  3. Action Layer
    Results can be injected straight into the focused window, pushed onto the clipboard, broadcast as webhooks, or queued for speech. Thinking fillers keep users informed while LLM calls finish.
  4. Feedback & Monitoring
    A central speech queue feeds multi-engine TTS outputs, while the performance monitor tracks latency, queue depth, and provider health. Logs stream to rotating files and the GUI overlay for instant triage.

Because each layer is modular, you can run full local stacks, hybrid setups, or remote-only TTS without rewriting the app.

Hands-Free Workflows in the Wild

Here are real workflows power users rely on daily:

Rapid Dictation
Hold the dictation hotkey, speak code or prose, and let VibeType insert it into VS Code, Figma, Confluence, or any focused window.
Clipboard Alchemy
Copy a gnarly stack trace, hit the clipboard hotkey, and receive a concise summary or fix suggestions—spoken aloud and pasted where you need them.
Contextual AI Modes
Switch modes mid-sentence. “Assistant” explains the bug, “Summarizer” produces a TL;DR, and “Corrector” writes an email response, all without leaving the microphone session.
Automation Hooks
Profiles, presets, and webhooks let you trigger builds, open pull requests, or update tickets verbally.
Status Overlay
A subtle overlay shows which provider is active, queue length, and whether the MCP server is healthy—perfect for streaming or screen recordings.

TTS Innovation: Kokoro, Piper, and ZipVoice

Speech feedback is core to the experience, so VibeType ships with a deep bench of engines:

  • Kokoro TTS — Neural, multi-language voices with automatic language detection. The Misaki G2P stack guarantees stable phonemes for Japanese, English, and beyond.
  • Piper — Lightweight neural voices that run on modest GPUs/CPUs. Ideal for rapid confirmations or long narrations.
  • Windows SAPI & OpenAI — Included for compatibility and quick onboarding when you want native system voices or hosted quality.
  • ZipVoice Cloning — Point at a sample clip, spin up a PyTorch or ONNX pipeline, and speak through that persona instantly. When ZipVoice is active, the MCP layer enforces the curated list of precomputed embeddings so agents can only request supported voices.
  • Voice Blending — Blend up to five Kokoro voices via weight sliders to create custom timbres for narration, assistants, or accessibility use cases.

All engines feed the same queue, so overlapping playback never happens, and the GUI gives you one-click interrupt/fallback controls.

MCP Server & Agent Automation

VibeType exposes a lightweight MCP implementation for HTTP and stdio transports. Highlights:

  • GET /health and GET /metadata for readiness probes and endpoint discovery.
  • POST / (JSON-RPC) for full MCP clients—no redirects, no surprises.
  • POST /speak and /speak_batch to enqueue text for sequential playback.
  • POST /phonemes to access Kokoro’s multilingual phonemizer (useful for lip-sync, SRT files, or animation).

The Settings → MCP tab provides Start/Stop/Restart buttons, live log streaming, a test speak button, an auto-start toggle, and ping helpers. Under the hood, manager threads catch log callback exceptions so the server never “fails silently” again.

Privacy, Security, and Reliability

Privacy is not a feature—it’s the default posture:

  • All processing is local unless you explicitly enable an external provider.
  • API keys live in encrypted config storage; nothing is hardcoded in source.
  • Microphone, clipboard, webhook, and network access are user-controlled toggles with clear indicators.
  • Rotating log files (3 × 5 MB) plus optional UI surfacing via configure_logging() keep auditing lightweight.
  • ZipVoice and other cloning engines respect opt-in voice lists, ensuring embeddings stay private.

Performance & Monitoring

The performance monitor records latency per subsystem (capture, LLM, TTS), queue depths, and provider uptime. Metrics surface inside the tray menu and overlay so you instantly know whether Whisper, Ollama, or a TTS engine is the current bottleneck. When something stalls, MCP logs, status toasts, and optional webhooks alert you immediately.

Developer Experience & Customization

Everything is scriptable:

  1. Config Files: JSON files under config/ let you pin providers, tweak TTS weights, and add webhook targets.
  2. Prompt Profiles: Define new AI personas in the Settings UI or by editing the stored prompts dictionary.
  3. Hotkeys: Every action gets its own keybind, including dictation, clipboard processing, AI speak, status overlay, and MCP tests.
  4. Extensible Providers: Add your own AI/TTS provider classes under core/ or piper_tts/ and register them without touching the GUI.
  5. MCP & Tests: New regression tests (e.g., tests/test_mcp_batch.py) make it easy to validate automation endpoints before shipping.

Roadmap Highlights

  • Resizable settings window with persistent layout state.
  • Graceful TTS fallback with automatic provider failover.
  • Task runner + file watcher so you can say “run unit tests” or “format project” hands-free.
  • Auto-updater plus portable builds for labs and classrooms.
  • MCP metadata expansions that advertise active ZipVoice embeddings, hardware hints, and agent-safe voice lists.

Quickstart Checklist

git clone https://github.com/NemesisGuy/Vibe-Type.git
cd Vibe-Type
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt
python VibeType.py
# Open Settings → configure AI/TTS/MCP → press dictation hotkey

Need MCP access? Start the server from the GUI or run python MCP/vibetts_mcp_server.py, then try:

Invoke-RestMethod http://127.0.0.1:9032/health
Invoke-RestMethod http://127.0.0.1:9032/metadata
Invoke-RestMethod http://127.0.0.1:9032/speak -Method POST \
  -ContentType 'application/json' -Body '{"text":"Hello from VibeType"}'

Call to Action

Whether you are recovering from RSI, building an on-stage demo, or simply craving a calmer workflow, VibeType gives you a private, programmable voice companion. Clone the repo, fire up the MCP server, and let your voice drive development without sacrificing privacy.

Questions or ideas? Open an issue, drop by docs/FEATURES.md, or explore the MCP examples under docs/. We can’t wait to see what you build with hands-free control.