No Python, no server, no CoreML — pure Swift through MLX. Models download automatically from HuggingFace on first run. The whole diarization stack is ~32 MB.
Everything is protocol-based and composable — VAD gates ASR, diarization feeds into transcription, embeddings enable speaker verification. Mix and match.
Blog post with architecture details: blog.ivan.digital
There's a lot of surface area here and contributions are very welcome — whether it's new model ports, iOS integration, performance work, or just filing issues. If you've been wanting to do anything with audio or MLX in Swift, come build with us.
You mention using MathJax for LaTeX rendering, which is great for web compatibility. Have you explored the potential limitations of rendering text due to the lack of Pango? This might affect clarity in complex equations. Also, any thoughts on how it performs with large animations compared to traditional Manim: does the browser handle it smoothly?
Regarding the user instruction aggregation process in the agent loop, I'm curious how you manage context retention in multi-turn interactions. Have you explored any techniques for dynamically adjusting the context based on the evolving user requirements?
The 'Broad Safety' guideline seems vague at first, but it might be beneficial to incorporate user feedback loops where the AI adjusts based on real-world outcomes. This could enhance its adaptability and ethics over time, rather than depending solely on the initial constitution.
Been using GitHub Copilot to handle the tedious webpack/babel config files and it's a game changer for modern web dev. No more spending hours debugging build pipeline issues - it generates 90% correct configs that just need minor tweaks.
ASR (Qwen3) → TTS (Qwen3 + CosyVoice, 10 languages) → Speech-to-Speech (PersonaPlex 7B, full-duplex) → Speaker Diarization (pyannote + WeSpeaker) → Voice Activity Detection (Silero, real-time streaming) → Forced Alignment (word-level timestamps)
No Python, no server, no CoreML — pure Swift through MLX. Models download automatically from HuggingFace on first run. The whole diarization stack is ~32 MB.
Everything is protocol-based and composable — VAD gates ASR, diarization feeds into transcription, embeddings enable speaker verification. Mix and match.
Repo: github.com/ivan-digital/qwen3-asr-swift (Apache 2.0)
Blog post with architecture details: blog.ivan.digital
There's a lot of surface area here and contributions are very welcome — whether it's new model ports, iOS integration, performance work, or just filing issues. If you've been wanting to do anything with audio or MLX in Swift, come build with us.