Working with embeddings (RAG, semantic search, clustering, recommendations, etc.), means:
- Generate embeddings
- Compute cosine similarity
- Run retrieval
- Hope it "works"
But then I stumbled upon the issue of not being able to determine why my RAG responses felt off, retrieval quality being inconsistent and clustering results looked weird.
Debugging embeddings was painful.
To solve this issue, we built this Embedding evaluation CLI tool to audit embedding spaces, not just generate them.
Instead of guessing whether your vectors make sense, it:
- Detects semantic outliers
- Identifies cluster inconsistencies
- Flags global embedding collapse
- Highlights ambiguous boundary tokens
- Generates heatmaps and cluster visualizations
- Produces structured reports (JSON / Markdown)
We just released an open source project for agent swarm based stock trading simulations built and tested by heyneo.so - Fully autonomous ML engineering agent.
The agent swarm self coordinates with each other via an asynchronous message bus.
There are around 10 agents with distinct roles:
- 3 Analyst Agents → Generate BUY/SELL signals (SMA crossovers, volume trends)
- 4 Trader Agents → Execute trades, manage $250K portfolios each
- 2 Risk Managers → Validate orders, enforce stop-loss rules
- 1 Reporter Agent → Aggregate P&L and generate reports
The simulation consists of capital allocation, risk checks like stop-losses and order blocking, and reporting baked into the flow. The system backtests over ~250 trading days, starts with a fixed $1M capital, and logs things like drawdown, blocked orders, and approval rates.
This project implements Quantization-Aware Training (QAT) for MobileNetV2, enabling deployment on resource-constrained edge devices. Built autonomously by [NEO](https://heyneo.so), the system achieves exceptional model compression while maintaining high accuracy.
Solution Highlights:
- 9.08x Model Compression: 23.5 MB → 2.6 MB (far exceeds 4x target)
- 77.2% Test Accuracy: Minimal 3.8% drop from baseline
- Full INT8 Quantization: All weights, activations, and operations
- Edge-Ready: TensorFlow Lite format optimized for deployment
- Single-Command Pipeline: End-to-end automation
Training can be performed on newer Datasets as well.
Founder here. I built NEO, an AI agent designed specifically for AI and ML engineering workflows, after repeatedly hitting the same wall with existing tools: they work for short, linear tasks, but fall apart once workflows become long-running, stateful, and feedback-driven.
In real ML work, you don’t just generate code and move on. You explore data, train models, evaluate results, adjust assumptions, rerun experiments, compare metrics, generate artifacts, and iterate; often over hours or days.
Most modern coding agents already go beyond single prompts. They can plan steps, write files, run commands, and react to errors. Where things still break down is when ML workflows become long-running and feedback-heavy. Training jobs, evaluations, retries, metric comparisons, and partial failures are still treated as ephemeral side effects rather than durable state.
Once a workflow spans hours, multiple experiments, or iterative evaluation, you either babysit the agent or restart large parts of the process. Feedback exists, but it is not something the system can reliably resume from.
NEO tries to model ML work the way it actually happens.
It is an AI agent that executes end-to-end ML workflows, not just code generation. Work is broken into explicit execution steps with state, checkpoints, and intermediate results. Feedback from metrics, evaluations, or failures feeds directly into the next step instead of forcing a full restart. You can pause a run, inspect what happened, tweak assumptions, and resume from where it left off.
Here's an example as well for your reference: You might ask NEO to explore a dataset, train a few baseline models, compare their performance, and generate plots and a short report. NEO will load the data, run EDA, train models, evaluate them, notice if something underperforms or fails, adjust, and continue. If training takes an hour and one model crashes at 45 minutes, you do not start over. Neo inspects the failure, fixes it, and resumes.
For sure, the request is going to a proprietary API for heavy lifting but the idea of this colab notebook is to extend the possibility of running 100s of experiments by calling out an API that handles the experiments as requested while the colab notebook becomes that single session of performing experiments on different models, hyperparameters and datasets, perform evals on all and get one final deployment.
Here's a free Colab notebook to fine-tune Gemma 2 2B-it model on MonsterAPI.ai
Gemma 2 2B from @Google @GoogleDeepMind is quite an interesting model as it's only 2 billion parameters in size and is able to beat a variant of GPT 3.5 Turbo. That's not a small feat. If a base model is good at majority of tasks then fine-tuning it for domain specific use-case provides a massive boost up.
And with Monster Tuner, fine-tuning LLMs like Gemma 2 2B is just a couple of clicks and the tuner comes auto-integrated with optimized frameworks like Unsloth, SDPA and FA 2 to fasten up the token processing.
This notebook has a complete workflow where it also formats the provided dataset to suit the EOS token formatting required by Gemma 2 2B IT model to deliver better fine-tuning results with a complete token distribution chart to help us get better insights about our data.
We have developed an exhaustive prompt by prompt comparison case study between Llama 405B and GPT 4O
Overall, GPT 4o's answer seemed to be more acceptable as per the use-case but 405B was not much behind. In fact in many areas, it seems to outperform and remain context relevant.
I have repeatedly asked this question in the chat I get by clicking on "Try 405B model" on Meta AI website and then it always tells me that it has 70 Billion params.
If that's the case then it might end up giving wrong information and we might assume it's still that big model.
Not sure if it's just me or everyone getting this?
But then I stumbled upon the issue of not being able to determine why my RAG responses felt off, retrieval quality being inconsistent and clustering results looked weird.
Debugging embeddings was painful.
To solve this issue, we built this Embedding evaluation CLI tool to audit embedding spaces, not just generate them.
Instead of guessing whether your vectors make sense, it: - Detects semantic outliers - Identifies cluster inconsistencies - Flags global embedding collapse - Highlights ambiguous boundary tokens - Generates heatmaps and cluster visualizations - Produces structured reports (JSON / Markdown)
Checkout the tool and feel free to share your feedback: https://github.com/dakshjain-1616/Embedding-Evaluator
This is especially useful for: - RAG pipelines - Vector DB systems - Semantic search products - Embedding model comparisons - Fine-tuning experiments
It surfaces structural problems in the geometry of your embeddings before they break your system downstream.