This is a great point. We completely agree that high-quality results is essential for adoption. It's basically table stakes for any tool like this to be useful. We've had several versions of this tool that weren't quite "good enough" and never saw any real use. Our latest version seems to meet the first quality threshold for actual work use.
Our method of evaluating quality is not super systematic right now. For this competitive landscape task, we have a "test suite" of ~10 companies and for each we have a sort of "must-include", "should-include", "could-include" set of competitors that should be surfaced. We run these through our tool and others and look at precision and recall on the competitor sets.
In terms of errors, right now our results are a little noisy, since we're biased towards being exhaustive vs selective. There are obviously irrelevant companies in the results that no human would have ever included. Our users can fairly easily filter these out by reading the one sentence overviews of the companies but it's still not a great UX. Actively working on this.
I wonder if it's more about convincing yourself that it faithfully follows the same workflow an analyst would follow. It's always possible to miss stuff, so the best a person or a machine can do is be demonstrably methodical, it sounds like... and that is easier to test. Unless there is really some magic tacit step that human analyst perform to get better answers.
Hahaha that reminds me of Erlich tripping in the desert trying to come up with one-liners for Pied Piper. Yeah definitely know our results are not perfect. We're going for being as exhaustive as possible right now so results can be noisy. And yes, slide generation is on the very near-term (next few days) roadmap :)