For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | sid-the-kid's commentsregister

never head of InWorld. Pretty impressive.


ooof. You saw the Chinese text. Yup, that's super annoying. We are trying to squash that hallucination.

Thanks for the feedback! That's helpful!


the chinese text happened last night in your main chat agent widget, the cartoon woman professing to be in a town in brazil with a lemon tree on her cupboard. she claimed it was a test of subtitling then admitted it wasn't.

btw, she gives helpful instructions like "/imagine" whatever but the instructions only seem to work about 50% of the time. meaning, try the same command or variants a few times, and it works about half of them. she never did shift out of aussie accent though.

she came up with a remarkably fanciful explanation why as a brazilian she sounded aussie and why imagining native accent like she said would work didn't...

i was shocked when /imagine face left turn to the side did actually work, the agent was in side profile and precisely as natural as the original front facing avatar

all in all, by far the best agent experience i've played with!


So glad you enjoyed it! We've been able to significantly reduce those text hallucinations with a few tricks, but it seems they haven't been fully squashed. The /imagine command only works with the image at the moment, but we'll think about ways to tie that into the personality and voice. Thanks for the feedback!


Thank you! We are considering to release an open-source version of the model. Somebody will do it soon. Might as well be us. We are mostly concerned with the additional overhead of releasing and then supporting it. So, TBD.


Overhead? None Your real concern is: will potential customers run the model by themselves and skipping us?

Answer is no because you will eventually release a subpar model not your sorta model.

Also people don't have infrastructure to run this at scale (100-500 concurrent users) at best they can run it for 1-2 concurrent users.

This could be a good way for peoples to test it then use your infra.

Ah but you do have an online demo, so you might think this is enough, WRONG.


Good question! Software gets democratized so fast that I am sure others will implement similar approaches soon. And, to be clear, some of our "speed upgrades" are pieced together from recent DiT papers. I do think getting everything running on a single GPU at this resolution and speed is totally new (as far as i have seen).

I think people will just copy it, and we just need to continue moving as fast as we can. I do think that a bit of a revolution is happening right now in real-time video diffusion models. There are so many great papers being published in that area in the last 6 months. My guess is that many DiT models will be real time within 1 year.


> I do think getting everything running on a single GPU at this resolution and speed is totally new

Thanks, it seemed to be the case that this was really something new, but HN tends to be circumspect so wanted to check. It's an interesting space and I try to stay current but everything is moving so fast. But I was pretty sure I hadn't seen anyone do that. Its a huge achievement to do it first and make it work for real like this! So well done!


One thing that is interesting: LLMs pipelines have been highly optimize for speed (since speed is directly related to cost for companies). That is just not true for real-time DiTs. So, there is still lots of low hanging fruit for how we (and others) can make things faster and better.


Curious about the memory bandwidth constraints here. 20B parameters at 20fps seems like it would saturate the bandwidth of a single GPU unless you are running int4. I assume this requires an H100?


Yep, the model is running on Hopper architecture. Anything less was not sufficient in our experiments.


it's a fair concern. but, we don't know r0fl. and we are not astroturfing.

even I am surprised with how many opnely positive comments we are getting. it's not been our experience in the past.


Thank you! Yes, right now we are using Qwen for the LLM. They also released a super fast TTS model that we have not tried yet, which is supposed to be very fast.


And, just like that, Max Headroom is back: https://lemonslice.com/try/agent_ccb102bdfc1fcb30


That.. is not Max Headroom.


Can you help us make him? What's the right voice? https://lemonslice.com/hn



I wonder how it would come across with the right voice. We're focused on building out the video layer tech, but at the end of the day, the voice is also pretty important for a positive experience.


1) yes on Max Headroom. we are on it. 2) it already is real time...?


Whoops! Mistook the "You're about to speak with an AI."-progress bar for processing delay.


I wonder if we should make the UI a more common interface (e.g. "the call is ringing") to avoid this confusion?

It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.


Makes sense. The init should be about 10s. But, after that, it should be real time. TBH, this is probably a common confusion. So thanks for calling it out.


Fix deployed! This is why it's good to launch on hacker news. thanks for the tip.


Nice one - thanks :)


glad we found somebody who likes it as much as us! BTW, biggest thing we are working to improve is speed of the response. I think we can make that much faster.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You