Hands-Free, Privacy-First Dev Tooling
VibeType: Building a Local-First Voice Coding Companion
VibeType is a Windows-focused desktop assistant that keeps every syllable on-device while translating your voice into code, commands, and realtime status updates. Below is the full story—how it works, why privacy matters, and what comes next.
Origin Story: From Dictation Fatigue to Flow State
VibeType was born out of frustration. Constant context switching between IDEs, browsers, terminals, design files, and ticket queues made “flow” feel mythical. Dictation tools helped, but every option streamed audio to a vendor cloud and returned plain text with little control. We wanted:
- Local-first processing so source code and recordings never leave the machine by default.
- Programmable AI behaviors that respond differently when you say “summarize this diff” versus “draft a terminal command.”
- Hands-free ergonomics for accessibility, injury recovery, or multitasking without reaching for a keyboard.
VibeType’s guiding principle is simple: give developers a voice-first assistant that is as trustworthy as their favorite text editor.
System Architecture Deep Dive
The app is divided into four major layers that can be mixed and matched:
- Capture & Transcription
Global hotkeys wake the microphone, stream audio through low-latency capture buffers, and pipe it to Whisper (or any local engine you choose). Automatic language detection keeps multilingual sessions coherent. - AI Toolkit
An Ollama-backed brain executes prompt presets like Assistant, Corrector, Summarizer, or Command Runner. External providers (OpenAI, Cohere, Anthropic) can be toggled per profile, but they are always opt-in. - Action Layer
Results can be injected straight into the focused window, pushed onto the clipboard, broadcast as webhooks, or queued for speech. Thinking fillers keep users informed while LLM calls finish. - Feedback & Monitoring
A central speech queue feeds multi-engine TTS outputs, while the performance monitor tracks latency, queue depth, and provider health. Logs stream to rotating files and the GUI overlay for instant triage.
Because each layer is modular, you can run full local stacks, hybrid setups, or remote-only TTS without rewriting the app.
Hands-Free Workflows in the Wild
Here are real workflows power users rely on daily:
- Rapid Dictation
- Hold the dictation hotkey, speak code or prose, and let VibeType insert it into VS Code, Figma, Confluence, or any focused window.
- Clipboard Alchemy
- Copy a gnarly stack trace, hit the clipboard hotkey, and receive a concise summary or fix suggestions—spoken aloud and pasted where you need them.
- Contextual AI Modes
- Switch modes mid-sentence. “Assistant” explains the bug, “Summarizer” produces a TL;DR, and “Corrector” writes an email response, all without leaving the microphone session.
- Automation Hooks
- Profiles, presets, and webhooks let you trigger builds, open pull requests, or update tickets verbally.
- Status Overlay
- A subtle overlay shows which provider is active, queue length, and whether the MCP server is healthy—perfect for streaming or screen recordings.
TTS Innovation: Kokoro, Piper, and ZipVoice
Speech feedback is core to the experience, so VibeType ships with a deep bench of engines:
- Kokoro TTS — Neural, multi-language voices with automatic language detection. The Misaki G2P stack guarantees stable phonemes for Japanese, English, and beyond.
- Piper — Lightweight neural voices that run on modest GPUs/CPUs. Ideal for rapid confirmations or long narrations.
- Windows SAPI & OpenAI — Included for compatibility and quick onboarding when you want native system voices or hosted quality.
- ZipVoice Cloning — Point at a sample clip, spin up a PyTorch or ONNX pipeline, and speak through that persona instantly. When ZipVoice is active, the MCP layer enforces the curated list of precomputed embeddings so agents can only request supported voices.
- Voice Blending — Blend up to five Kokoro voices via weight sliders to create custom timbres for narration, assistants, or accessibility use cases.
All engines feed the same queue, so overlapping playback never happens, and the GUI gives you one-click interrupt/fallback controls.
MCP Server & Agent Automation
VibeType exposes a lightweight MCP implementation for HTTP and stdio transports. Highlights:
GET /healthandGET /metadatafor readiness probes and endpoint discovery.POST /(JSON-RPC) for full MCP clients—no redirects, no surprises.POST /speakand/speak_batchto enqueue text for sequential playback.POST /phonemesto access Kokoro’s multilingual phonemizer (useful for lip-sync, SRT files, or animation).
The Settings → MCP tab provides Start/Stop/Restart buttons, live log streaming, a test speak button, an auto-start toggle, and ping helpers. Under the hood, manager threads catch log callback exceptions so the server never “fails silently” again.
Privacy, Security, and Reliability
Privacy is not a feature—it’s the default posture:
- All processing is local unless you explicitly enable an external provider.
- API keys live in encrypted config storage; nothing is hardcoded in source.
- Microphone, clipboard, webhook, and network access are user-controlled toggles with clear indicators.
- Rotating log files (3 × 5 MB) plus optional UI surfacing via
configure_logging()keep auditing lightweight. - ZipVoice and other cloning engines respect opt-in voice lists, ensuring embeddings stay private.
Performance & Monitoring
The performance monitor records latency per subsystem (capture, LLM, TTS), queue depths, and provider uptime. Metrics surface inside the tray menu and overlay so you instantly know whether Whisper, Ollama, or a TTS engine is the current bottleneck. When something stalls, MCP logs, status toasts, and optional webhooks alert you immediately.
Developer Experience & Customization
Everything is scriptable:
- Config Files: JSON files under
config/let you pin providers, tweak TTS weights, and add webhook targets. - Prompt Profiles: Define new AI personas in the Settings UI or by editing the stored prompts dictionary.
- Hotkeys: Every action gets its own keybind, including dictation, clipboard processing, AI speak, status overlay, and MCP tests.
- Extensible Providers: Add your own AI/TTS provider classes under
core/orpiper_tts/and register them without touching the GUI. - MCP & Tests: New regression tests (e.g.,
tests/test_mcp_batch.py) make it easy to validate automation endpoints before shipping.
Roadmap Highlights
- Resizable settings window with persistent layout state.
- Graceful TTS fallback with automatic provider failover.
- Task runner + file watcher so you can say “run unit tests” or “format project” hands-free.
- Auto-updater plus portable builds for labs and classrooms.
- MCP metadata expansions that advertise active ZipVoice embeddings, hardware hints, and agent-safe voice lists.
Quickstart Checklist
git clone https://github.com/NemesisGuy/Vibe-Type.git
cd Vibe-Type
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt
python VibeType.py
# Open Settings → configure AI/TTS/MCP → press dictation hotkey
Need MCP access? Start the server from the GUI or run python MCP/vibetts_mcp_server.py, then try:
Invoke-RestMethod http://127.0.0.1:9032/health
Invoke-RestMethod http://127.0.0.1:9032/metadata
Invoke-RestMethod http://127.0.0.1:9032/speak -Method POST \
-ContentType 'application/json' -Body '{"text":"Hello from VibeType"}'
Call to Action
Whether you are recovering from RSI, building an on-stage demo, or simply craving a calmer workflow, VibeType gives you a private, programmable voice companion. Clone the repo, fire up the MCP server, and let your voice drive development without sacrificing privacy.
Questions or ideas? Open an issue, drop by docs/FEATURES.md, or explore the MCP examples under docs/. We can’t wait to see what you build with hands-free control.