For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | dinakernel's commentsregister

My worry is that ASR will end up like OCR. If the multi modal large AI system is good enough (latency wise), the advantage of domain understanding eats the other technlogies alive.

In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.


This is both good and bad. Good ASR can often understand low quality / garbled speech that I could not figure out, but it also "over corrects" sometimes and replaces correct but low prior words with incorrect but much more common ones.

With OCR the risk is you get another xerox[1] incident where all your data looks plausible but is incorrect. Hope you kept the originals!

(This is why for my personal doc scans, I use OCR only for full text search, but retain the original raw scans forever)

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...


This is exactly the case today. Multimodal LLMs like gpt-4o-transcribe are way better than traditional ASR, not only because of deeper understanding but because of the ability to actually prompt it with your company's specific terminology, org chart, etc.

For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.

BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe


Many ASR models already support prompts/adding your own terminology. This one doesn't, but full LLMs especially such expensive ones aren't needed for that.

Why are you 'worried' about it? Shouldn't we strive for better technology even if it means some will 'lose'?

"Better" isn't just about increasing benchmark numbers. Often, it's more important that a system fails safely than how often it fails. Automatic speech recognition that guesses when the input is unclear will occasionally be right and therefore have a lower word error rate, but if it's important that the output be correct, it might be better to insert "[unintelligible]" and have a human double-check.

It's better in terms of WER. It's not better in terms of not making shit up that sounds plausible.

Probably the answer is simply to tweak the metric so it's a bit more smart than WER - allow "unclear" output which is penalised less than actually incorrect answers. I'd be surprised if nobody has done that.


Ideally, you'd be able to specify exactly what you want - do you want to write-out filled pauses ("aaah", "umm")? Do you want to get a transcription of the the disfluencies - re-starts, etc. or just get out a cleaned up version?

ASR has already proved its usefulness. Dictation tools are a prime example. Ever since whisper came out, usefulness for AST models running locally suddenly became a thing. Opened up soo many variants

https://superwhisper.com

https://carelesswhisper.app

https://macwhisper.com


For quite a long time there will be a greater advantage to local processing for STT than for TTT chat, or even OCR. Being able to do STT on the device that owns the microphone means that the bandwidth off that device can be dramatically reduced, if it's even necessary for the task at hand.

This turned out to be a bug. https://x.com/om_patel5/status/2038754906715066444?s=20

One reddit user reverse engineered the binary and found that it was a cache invalidation issue.

They are doing some hidden string replacement if the claude code conversation talks about billing or tokens. Looks like that invalidates the cache at that point.

If that string appears anywhere in the conversation history, I think the starting text is replaced, your entire cache rebuilds from scratch.

So, nothing devious, just a bug.


I'm not sure this is the issue. I asked Claude Code a simple question yesterday. No sub agents. No web fetches. Relatively small context. Outside of peak hours. Burned 8% of my Max 5x 5hr usage limit. I've never seen anything like this before, even when the cache is cold.

> BUG 2: every time you use --resume, your entire conversation cache rebuilds from scratch. one resume on a large conversation costs $0.15 that should cost near zero.

I use it with an api key, so I can use /cost. When I did a resume, it showed the cost from what I thought was first go. I don't think it's clear what the difference is between api key and subscription, but am I believe that simply resuming cost me $5? The UI really make it look like that was the original $5.


You have to actually send something

Nothing devious, but is Anthropic crediting users? In a sense, this is _like_ stealing from your customer, if they paid for something they never got.

Not seeing any quota returned on my Pro account. My weekly usage went up to 20% in about one hour yesterday before I panicked and stopped the task. It was outside of the prime hours too which are supposed to run up your quota at a slower rate.

Outside of prime hours is the normal rate. Prime is at a fast rate, as of about two weeks ago.

your linked bug is a cherry pick of the worst case scenario for the first request after a resume.

While it should be fixed, this isn't the same usage issue everyone is complaining about.


That bug would only affect a conversation where that magic string is mentioned, which shouldn't be common.

I guess so - but for people working on billing section of a project or even if they include things like - add billing capability etc in Claude MD - it might be an issue, I think

Anecdotally when Claude was error 500'ing a few days ago, its retries would never succeed, but cancelling and retrying manually worked most of the time.

That is a summary and a picture of https://old.reddit.com/r/ClaudeAI/comments/1s7mkn3/psa_claud... it looks like?


[flagged]


Whoa. Is Claude coming in here and generating responses about itself.

https://stopsloppypasta.ai/en/


Yep I was going to say - this is just bad design. This kind of approach is inherently fragile, you are unavoidably destroying information in some sense by mixing things together

Default setting latest should be caught in every static code scanner. How many times has this issue been raised.

Seriously? Dont they want their system to succeed? I cant think of a better way of alienating the target customer than this.

The best part was the Doom running over AT Protocol. Jetstream is a bit patchy, but running Doom - I would never have thought it possible

Have you read mike masnicks ? https://www.techdirt.com/2026/03/25/ai-might-be-our-best-sho...

It actually points out the completely opposite and I liked that quite a bit That AI allows us to get back the open web in in a way.


This has been my issue from long. AI CANNOT ever act as Emotional Crutch. This is something companies develop for engagement, and I believe that this is actively harmful in the long run.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You