gkucsko's comments

gkucsko · on Dec 23, 2023

Our models support full 2+ mins of coherent generation but generating a couple of verses at a time through continue gives good results you can keep picking the continuations that sound best!

gkucsko · on Dec 23, 2023

We’re US based but come from pretty much across the globe and my wife is Punjabi, so yeah that’s the origin of the name :)

jan_Inkepa · on Dec 24, 2023

Really impressed with this site! I used it to generate several plausible/catchy songs in German for a pizza Christmas party I'm having tomorrow.

https://app.suno.ai/song/650b54ec-b349-447c-9219-d461e4b5282... https://app.suno.ai/song/0742dd0a-a52f-4491-b61b-5289b533d1b... I am entertained that it made the mistake of turning a direction ("put sleigh bells in the background") into the lyric, with backing singer saying "Klingelingeling"

I think the low 'hit' rate might be turning off some people, but I'm happy to audition through 10 songs to find a single good one.

I hadn't been following the state of audio generation with AI, but this definitely feels like a breakthrough moment for me, just as image generation did, and chatgpt.

I'm a video-game developer, so can imagine using this now and then for quick jam games.

Anyway, cool cool stuff!

pknerd · on Dec 25, 2023

hahah wow! cool :-)

PS: OT, I am reading this Bark thing(https://github.com/suno-ai/bark). Can I run it locally on a Macbook 2015 with 8GB RAM?

gkucsko · on April 20, 2023

yeah sometimes there are definitely artifacts. technically they can be removed pretty easily with another model (like denoiser from FB) but for now we wanted to keep it simple to learn to control these things better through prompt engineering. Like when using a high quality input prompt it generally continues with high quality

meepmorp · on April 20, 2023

At least in the last example, with the man and woman and the expensive oat milk, the background noise seemed to fit a likely public conversation scenario. I wasn't sure if it was accidental or not.

gkucsko · on April 20, 2023

Hey, one of the Suno founders/creators of Bark here. Thanks for all the comments, we love seeing how we can improve things in the future. At Suno we work on audio foundation models, creating speech, music, sounds effects etc….

Text to speech was a natural playground for us to share with the community and get some feedback. Given that this model is a full GPT model, the text input is merely a guidance and the model can technically create any audio from scratch even without input text, aka hallucinations or audio continuation.

When used as a TTS model, it’s very different from the awesome high quality TTS models already available. It produces a wider range of audio – that could be a high quality studio recording of an actor or the same text leading to two people shouting in an argument at a noisy bar. Excited to see what the community can build and what we can learn for future products.

Please let us know with any feedback, or if you’re interested in working on this: bark@suno.ai

ttul · on April 20, 2023

This tech will be used by crooks to automate attacks. Generate the language using GPT-4 and the audio using Bark, and then start making phone calls. Because it’s open source, all you need is GPUs. This is not a criticism. I’m impressed and grateful for the openness. Everyone needs to wake up and recognize that these attacks are coming at us essentially right now.

rcme · on April 21, 2023

yawn who cares? If it’s an issue, let law enforcement handle it. Everything has nefarious uses. Humanity marches on.

martyfmelb · on April 21, 2023

As technology gets more powerful more quickly, and the rule of law becomes more and more unable to prevent societal damage, this response becomes woefully inadequate.

For reference, look at how the societal damage of social networks has been handled: too little, too late. Same goes for RentTech.

But, I don't know the solution. The common computer has become so powerful that we cannot simply rely on inaccessible materials to prevent the danger of overly-powerful tech spreading too fast, as we do with bioweapons or traditional WMDs.

Fight tech with more tech, I suppose.

barrysteve · on April 21, 2023

You'll care when it effects you.

We want to go back to New York in the 90s? Petty theft everywhere?

isbvhodnvemrwvn · on April 21, 2023

The technology already exists, shutting down a company and pretending that it doesn't doesn't solve the problem.

lynx23 · on April 21, 2023

Like kitchen knives, which are used to end unhappy relationships. Is this really an argument we should be talking about? Wouldn't you feel silly presenting such an argument, to a household knive maker?

SanderNL · on April 21, 2023

That comparison holds up better if you imagine a world without knives or sharp objects of any kind. Now you can suddenly do tremendous harm by wielding a pointy stick. I don't think it's reaching to point out the dangers you just introduced.

With great power comes .... ? Profit?

generalizations · on April 21, 2023

Except that world used to exist, back in the stone age, and we're all far better off now because we didn't choose to live in fear of the misuse of powerful tools.

SanderNL · on April 21, 2023

I wouldn’t want to be around the first few guys with pointy weapons, but I am sure you would be fine.

CooLOGIC · on April 20, 2023

Recommend to encode sub-audible tracking symbols in synthesize speech patterns that includes GPS, IP, timestamp, country of origin. We do that already in bootleg movies so we can apply similar methods in synthesizes speech.

gs17 · on April 21, 2023

It's open source, and even if it wasn't, it would likely be pretty easy to remove those (not that you would get GPS) from an audio file.

CamperBob2 · on April 21, 2023

Hmm, I don't know about that. The data rate used by civilian GPS (L1 C/A) is only 50 bps. The symbols are normally spread over a couple MHz of bandwidth to make it possible to recover at levels below the thermal noise floor. I see no reason why the same thing couldn't be done at baseband, adding an imperceptible bit of extra noise to an audio signal.

Of course, you wouldn't encode real-time navigation data, but a small block of identifying text. Either way, though, someone without a copy of the spreading code isn't going to notice it or decode it. Given enough redundancy in both the time and frequency domains, removing it wouldn't be easy either.

BHSPitMonkey · on April 21, 2023

The real problem is that bad actors would simply encode some other person's coordinates/metadata into the recordings they produce, and we'll have been trained by then to blindly accept the presence of these markers as strong evidence of guilt.

gmerc · on April 21, 2023

Nothing stops you from using closed source solutions for that.

dmix · on April 20, 2023

How are the voices determined? Is there an option or is it just random/based on the prompts like "WOMAN"?

turnsout · on April 20, 2023

Amazing work so far! Do you have any sense about how difficult it would be to enable M1/M2 or CoreML support?

gkucsko · on April 20, 2023

thanks, the model itself is a pretty vanilla gpt model based heavily on karpathy's nanogpt, so should not need too many bells and whistles to get it running on specific architectures. that said i have very little experience with platform specific development, so would looove some help from the community :)

brucethemoose2 · on April 21, 2023

Could you stick a torch.compile in the inference and training code, maybe gated behind a flag? This should help AMD/Nvidia performance (and probably other vendors soon) significantly.

PyTorch themselves used nanoGPT training as demo for this: https://pytorch.org/blog/accelerating-large-language-models/

ttul · on April 20, 2023

A serious nod to Karpathy here. They could have chosen any other Transformer architecture, but chose perhaps the most reachable one - in the literal sense.

tmzt · on April 20, 2023

Would the same apply to a GGML port or are the architechtures too different?

jijji · on April 20, 2023

I like the emphasis tags, it's something that is not seen with a lot of these transformer models.... things like [laughs] makes alot of sense... i could see where hundreds or possibly thousands of emphasis style tags could be added to support a vast array of intonations in human speech. i.e. [yells], [shouts], [cries], [crying], [whispers], [sarcasasm], etc

gkucsko · on April 20, 2023

haha https://demo.suno.ai

gkucsko · on April 20, 2023

history prompts are just unconditionally generated TTS from the same model. any of those can be used as history, but for convenience 10 are provided for each language (to generate things with consistent voices)

turnsout · on April 20, 2023

So the history prompts are collections of text/audio pairs?

gkucsko · on April 20, 2023

history is semantic, coarse and fine. so essentially the same thing thats getting generated just using it as an input before the generation

CamperBob2 · on April 22, 2023

So how do you clone an existing speaker's voice? That's the part I don't get.

gkucsko · on April 20, 2023

it's more meant to show code switching. more examples here: https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a4...

gkucsko · on April 19, 2023

Thanks :) Let us know if you come across interesting findings!

gkucsko · on April 19, 2023

Thanks! The model learns a lot from unsupervised (as well as supervised) audio, so technically low-quality and high-quality audio are both just as likely to the model as music, background sounds or really anything else including echos or bad microphones :) will be interesting to learn how to control for these things, either through prompting or other switches during training/inference

gkucsko · on Feb 24, 2015

There is no sign-up (microsoft account or similar) necessary, the broadcast link is much shorter and easier to share, and it's open source :)

HN For You