For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | georgemandis's commentsregister

>I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

Yeah, I think that's fair. I didn't really think this through as I was writing it.

I'm not even so sure "ending up with nonsense" here is the worst outcome. It might be unavoidable with this approach and if that had been the only problem this bug might have been less memorable.

The real problem—which I mention didn't articulate/emphasize particularly well—was that these invalid surrogate pairs were getting passed into `encodeURIComponent` somewhere deep in the stack and choking catastrophically on them. That was the "real" bug at the end of the day, but the invalid surrogate pairs and the way they were getting created on the way were a fun journey to untangle.


You are reminding me we also circled an issue at one point where a backend system in Python needed to agree on the same character count length of a piece of content was the client (JavaScript). Another place Intl.Segmenter would've helped.

If I'm remembering correctly, we briefly explored a solution where we told Python "This is a UTF-16LE encoded string" so the count would match, but I think we learned/realized the endianness is actually dictated by the client's machine (Going from memory here). Ultimately we just changed the solution so the client was the source of truth about lengths and counts.

These threads are surfacing all kinds of things I forgot about and didn't add in that blog post. Maybe I need to write another, haha.


The language handled it fine. It will generally just show replacement characters (�) for combos that don't map to anything.

It was really `encodeURIComponent` that didn't handle it gracefully.

If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"):

encodeURIComponent("\uD83E\uDD20")

If you give it an invalid surrogate pair, it will throw an actual error:

encodeURIComponent("\uDD20\uD83E")


No, the language did not handle it fine. It allowed an invalid Unicode string to exist. This is basically a UTF-16 affliction—nothing that does UTF-16 validates, whereas almost everything that does UTF-8 does validate. encodeURIComponent deals with UTF-8, so of course it throws.


I'm realizing `encodeURIComponent` is actually part of the ECMA spec! I thought it was something provided by the browser like `window` or `document`. I withdraw my "the language handled it fine" comment, haha.

Before I'd looked that up I was going to say: I feel like "don't allow an invalid Unicode string to exist all" feels like a separate/bigger problem to me from "handling it fine" when they do get created. To the extent I can hand JavaScript an invalid combination of code units in a variety of other scenarios, returning a � felt fine.

e.g. // valid String.fromCodePoint(0xd83e, 0xdd20) // invalid, but "�" is ... fine? String.fromCodePoint(0xdd20, 0xd83e)


In Rust, an invalid Unicode string simply cannot exist (* unless you use unsafe, but all bets are off then). An important part of this is that the code unit, the scalar value and the string are three different types (u8, char, str). Iteration must decide if it wants to go by code unit or by scalar value (… or by extended grapheme cluster, but that’s not provided in std).

JavaScript’s problems start with not having separate code unit or scalar value types. Sequences of UTF-16 code units, individual UTF-16 code units and scalar values all use the type string. (Code unit and scalar value also both use number in some contexts.)

The first step to fixing JavaScript’s bad semantics would be separating the code unit and scalar value types. If you did that… the changes required to support strict strings are perhaps surprisingly small. Even migrating to UTF-8 semantics is not very hard then.

Unfortunately, JavaScript seems very determined to do stupid things and allow stupid things and then do more stupid things with the stupid things it foolishly allowed.


My recollection (that I didn't add to the story): I don't think Intl.Segmenter had great browser support then (2022). Even if it had it still wasn't a quick/obvious fix for our problem with where it was occurring in our stack. But I do remember looking at it then.


Just noticed this is getting some traffic! It's a little buried in the post, but I made an interactive tool for exploring surrogate pairs as part of this:

- https://george.mand.is/invalid-surrogate-pairs/

I thought it was something that's easier to play with and feel than necessarily just read about.


Definitely in the same spirit!

Clearly the next thing we need to test is removing all the vowels from words, or something like that :)


I had this same thought and won't pretend my fear was rational, haha.

One thing that I thought was fairly clear in my write-up but feels a little lost in the comments: I didn't just try this with whisper. I tried it with their newer gpt-4o-transcription model, which seems considerably faster. There's no way to run that one locally.


I kind of want to take a more proper poke at this but focus more one summarization accuracy over word-for-word accuracy, though I see the value in both.

I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?

Maybe I'll try three approaches:

- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)

- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs

- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)


Hahaha. Okay, okay... I will watch it now ;)

(Thanks for your good sense of humor)


I like that your post deliberately gets to the point first and then (optionally) expands later, I think it's a good and generally underutilized format. I often advise people to structure their emails in the same way, e.g. first just cutting to the chase with the specific ask, then giving more context optionally below.

It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.

My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.

Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.


I watched your talk. There are so many more interesting ideas in there that resonated with me that the summary (unsurprisingly) skipped over. I'm glad I watched it!

LLMs as the operating system, the way you interface with vibe-coding (smaller chunks) and the idea that maybe we haven't found the "GUI for AI" yet are all things I've pondered and discussed with people. You articulated them well.

I think some formats, like a talk, don't lend themselves easily to meaningful summaries. It's about giving the audience things to think about, to your point. It's the sum of storytelling that's more than the whole and why we still do it.

My post is, at the end of the day, really more about a neat trick to optimize transcriptions. This particular video might be a great example of why you may not always want to do that :)

Anyway, thanks for the time and thanks for the talk!


> I often advise people to structure their emails [..]

I frequently do the same, and eventually someone sent me this HBR article summarizing the concept nicely as "bottom line up front". It's a good primer for those interested.

https://hbr.org/2016/11/how-to-write-email-with-military-pre...


Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.

> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube

Doesn't YouTube do this for you automatically these days within a day or so?


> Doesn't YouTube do this for you automatically these days within a day or so?

Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:

https://github.com/jdepoix/youtube-transcript-api

The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.

Probably would be a good idea to add a delay to it and wait for the automatic ones though :)


> I wonder if this "speed up the audio" trick would save you even more.

At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.


> Doesn't YouTube do this for you automatically these days within a day or so?

Last time I checked, I think the Google auto-captions were noticeably worse quality than whisper, but maybe that has changed.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You