More

saurabh20n · on April 19, 2024

Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.

ege_erdil · on April 19, 2024

we're not sure if the actual data exactly matches our reconstruction, but one of the authors pointed out to us that we can exactly reproduce their scaling law if we make the mistake they made when fitting it to the data

what they did was to take the mean of the loss values across datapoints instead of summing them and used L-BFGS-B with the default tolerance settings, so the optimizer terminated early, and we can reproduce their results with this same mistake

so our reconstruction appears to be good enough

saurabh20n · on Nov 30, 2023

Discussion of the 72B model happening here: https://news.ycombinator.com/item?id=38475501

saurabh20n · on Nov 30, 2023

Summary from https://arxiv.org/pdf/2309.16609.pdf --- (q: how does one format lists on HN?)

* qwen-{1.8B,7B,14B}:

  * 3 trillion tokens; start with BPE tiktoken, cl100k base vocab, augmented with chinese, numbers split into digits, final vocab 152k.

  * RoPE - rotary positional embedding

  * context length 2048

  * qwen-14b perf percentages: 66.3 MMLU(5), 72.1 CEval(5), 61.8 GSM8K(8), 24.8 MATH(4), 32.3 HumanEval(0), 40.8 MBPP(3), 53.4 BBH(3); beats LLaMA2-13B on all, but behind LLaMA2-70B on all except CEval, MATH and HumanEval (somewhat surprising)

* code-qwen-{7B,14B}

  * additional 90B code tokens over base

  * context length 8192, flash attention

  * 14B perf: humaneval 66.4, mbpp 52.4; ok, but not stellar (similar numbers as OSS wizardcoder-py, and lower than gpt-3.5)

* math-qwn-{7B,14B}-chat

  * math instructional dataset

  * context length 1024

  * 14B perf: gsm8k 69.8, MATH 24.2, Math401 85.0, Math23K 78.4 (substantially better than OSS in the same weight class (WizardMath and GAIRMath-Abel) on MATH but same ballpark on GSM8k -- surprising). Math23K is chinese grade school math; and Math401 is arithmetic ability.

* comprehensive automatic evaluation in Appendix A.2.1 pg 36 (based on OpenCompass'23)

saurabh20n · on Nov 26, 2023

actual title: “Prompting Frameworks for Large Language Models: A Survey”

LLM frameworks might imply stack for building models (pertaining, fine tuning, inference etc)

dmezzetti · on Nov 26, 2023

Title updated.

saurabh20n · on April 21, 2023

Notes from quick read of paper at https://arxiv.org/abs/2302.10866. Title of popsci is overreaching, this is a drop-in subquadratic replacement for attention. Could be promising, but to be seen if it is adopted in practice. skybrian (https://news.ycombinator.com/item?id=35657983) points out new blog post by authors, and prev discussion of older (march 28th) blog post. Takeaways:

* In standard attention in transformers, cost scales quadratically with length of sequence, which restricts model context. This work presents subquadratic exact operator allowing it to scale to larger contexts (100k+).

* They introduce an operator called "Hyena hierarchy", a recurrence over 2 subquadratic operations: long convolution, and element-wise mul gating. Sec 3.1-3.3 define the recurrences, matrices, and filters. This is importantly, a drop in replacement for attention.

* Longer context: 100x speedup over FlashAttention at 64k context (if we view flash attention as an non-approx engg optimization, then this work is improving algorithmically, and getting OOM over that). Associate recall, i.e., just pull data, show improvements: Experiments on 137k context, and vocab sizes of 10-40 (unsure why they have bad recall on small length sequence with larger vocab, but they still outperform others)

* Comparisons (on relatively small models, but hoping to show pattern) with RWKV (attention-free model, trained on 332B tokens), GPTNeo (trained on 300B tokens), with Hyena trained on 137B tokens. Models are 125M-355M sized. (Section 4.3)

* On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar to GPTNeo (although technically they underperform a bit for zero-shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)

* Because they can support large (e.g., 100k+) context, they can do image classification. They report ballpark comparable against others. (Table 4.7)

Might have misread some takeaways; happy to be corrected.

saurabh20n · on March 14, 2023

Congrats on the launch. I think you should share some technical details for a more substantial pitch. You are using the OSS BigCode effort and "The Stack" [1, 2] (as you say in another comment), which is great.

A few questions that might help an enterprise customer: How big is your base model? Where did you find more datasets (maybe just a hint would be sufficient)? Are you using SantaCoder [3]? Anything you can say about your fine-tuning that makes it special? Totally on board with you that HumanEval/MBPP are not great benchmarks for real world, and do you have a suggested alternative to help me see the value?

The calculus for an enterprise customer might be: "We could fine tune a 6B model on our internal code and internal benchmarks (say with a month of work, a few thousand in compute, 2 people on task), but I'd rather buy an off-the-shelf solution like codecomplete.ai. They give us XYZ benefits." Articulate the XYZ for a technical decision maker who will be your target audience.

* [1] https://huggingface.co/datasets/bigcode/the-stack

* [2] https://arxiv.org/abs/2211.15533

* [3] https://huggingface.co/bigcode/santacoder

lumax15 · on March 14, 2023

Great questions. We want to keep some of our technical details closer to the chest, so I won't go into the specific technologies we're using here.

I will expand a bit on fine-tuning. It's really hard to get this right, and the iteration speed is slow. Of course these companies can build their own, but we want to save them a lot of headache.

So far, we haven't found any off-the-shelf open source base model that works super well for code completions. We've augmented models with a huge amount of data in order to see our current performance, and we ran into a lot of pain along the way.

saurabh20n · on Feb 24, 2023

Quick notes from first glance at paper https://research.facebook.com/publications/llama-open-and-ef...:

* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]

* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]

* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.

* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]

* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)

machinekob · on Feb 24, 2023

I hate when people don't include approximation for traning before final hyperparameters are found as its most costly part of whole process most of the time.

Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-

scotty79 · on Feb 25, 2023

Aren't those done on smaller version of the same model?

323 · on Feb 24, 2023

> we used 2048 A100-80GB for a period of approximately 5 months

Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?

Wolfram Alpha:

- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)

- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)

We still have the edge.

LOL, I'm being downvoted, I wonder way. Some don't like the question.

zhynn · on Feb 24, 2023

You have to include our evolutionary history too. A considerable amount of our sophisticated behavior doesn't require specific training, as it is encoded in our genetic and epigenetic systems. We aren't starting from zero.

osigurdson · on Feb 24, 2023

Then you would need to include the our history in the GPU calculation. GPUs require evolutionary bootstrapping - they didn't materialize alongside the first few hydrogen atoms post BB.

melling · on Feb 24, 2023

Every human requires the same energy, 20+ years, and training.

The trained computer model can be duplicated and used, requiring much less energy.

None of this matters to me, though.

The goal is to build better models. We can worry about the efficiency later.

swyx · on Feb 25, 2023

exactly. we are speedrunning 200,000 years of intelligent life evolution here.

isoprophlex · on Feb 24, 2023

You mean MWh maybe, not MW/h? (which is what, J/s^2 in SI... "Power rate".)

323 · on Feb 24, 2023

Right, I used the correct MWh in Wolfram, but for some reason wrote MW/h, I think it was written like that a long time ago on electricity bills.

Dylan16807 · on Feb 24, 2023

> We still have the edge.

Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.

Tepix · on Feb 26, 2023

Don't mix MW/h with MWh.

Anyway, i remember hearing that the brain uses 60 Watt. That's 10.5MWh in 20 years.

But, we can't transfer/copy that gained knowledge limitlessly.

robbiep · on Feb 24, 2023

It’s because your human math for power output is so far off it’s hard to know where to start to point you in the right direction

323 · on Feb 24, 2023

Please do tell. Or better provide your estimation. I just took raw calorie intake, no heating/transportation/lighting/computer usage/....

WASDx · on Feb 24, 2023

A thing to keep in mind is that 1 MWh of raw calories takes much more than 1 MWh to produce (fuel for tractors, inefficiency of meat etc). The GPU energy is also easier to make renewable.

I did an extremely rough calculation recently that the training of GPT-3 is comparable to one transatlantic flight (all passengers combined) in terms of emissions, very depending on the energy mix of course.

Teever · on Feb 25, 2023

That's the entire problem. There's so much more energy that goes into a modern human beyond just what they eat. Beyond physical items you've listed like clothing there's also education and healthcare. Those two institutions are critical in making a modern human and they both have their own dependency chains of physical resource, energy, and the input of even more humans.

programmer_dude · on Feb 25, 2023

Your units are bad. Did you mean MWh instead of MW/h?

zozbot234 · on Feb 24, 2023

https://github.com/facebookresearch/llama/blob/main/MODEL_CA... (linked in OP) has basic information about this model.

SethTro · on Feb 24, 2023

(1022362 + 82432) gpu-hours / 2048gpus / 5 months ~= 15% uptime.

That's only 0.08 nines of availability!

I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.

Tepix · on Feb 26, 2023

They may have thrown away some models that didn't turn out great.

foobiekr · on Feb 25, 2023

Poor GPU utilization even when available is the rule. Truly amazing. Staging of data is probably a huge part of it.

pavelstoev · on Feb 25, 2023

At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs

foobiekr · on Feb 25, 2023

90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.

bertday · on Feb 25, 2023

Is it failures or is this some backfill/budget scheduling while everyone is sleeping?

foobiekr · on Feb 25, 2023

A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.

woeirua · on Feb 24, 2023

These cost estimates really make me question OpenAI's valuation.

Also, they kind of prove to me that most companies are totally incapable of making the investments necessary to get much out of this type of AI.

pgt · on Feb 25, 2023

Financial hurdles to competitors can make the company that has overcome them more defensible.

machinekob · on Feb 25, 2023

Sadly big players take all in current world and microsoft is pretty big :|

sandGorgon · on Feb 24, 2023

>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.*

what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?

vishal0123 · on Feb 24, 2023

Scaling law is for training till convergence. Both PALM and this model have been undertrained. See the training loss plot in the paper.

sandGorgon · on Feb 25, 2023

hey thanks for your reply.

umm...so does OpenAI. In fact this is OpenAI discovery from [1]:

>Convergence is inefficient: When working within a fixed compute budget C but without any other restric- tions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)

>We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8)

P.S. Not trolling. genuinely trying to learn.

[1] https://arxiv.org/abs/2001.08361

cubefox · on Feb 25, 2023

This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556

sandGorgon · on Feb 27, 2023

hi again - genuinely trying to learn here. The Chinchilla paper is a COMPETING thesis right ? the OpenAI thesis hasnt changed or superseded here.

vishal0123 · on Feb 26, 2023

LLAMA made tradeoff for reducing parameter budget instead of training computation budget. This is better for inference computation budget.

Optimal number of tokens for 7B parameters is around 140B tokens[0], and meta trained it for trillion tokens.

[0]: https://arxiv.org/pdf/2203.15556.pdf

akomtu · on Feb 25, 2023

By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table?

hansvm · on Feb 25, 2023

Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.

Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).

That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.

Your question feels like it has a motive though. What are you really asking?

akomtu · on Feb 26, 2023

LLMs need a baseline to compare with. I suspect that when they get compared with a fuzzy hash table of a similar size (that returns a range of probabilities), their performance will become unimpressive.

hansvm · on Feb 26, 2023

You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.

LLMs can track mid-range dependencies though. Consider the following input

> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.

Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.

Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.

None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.

akomtu · on Feb 26, 2023

The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.

The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.

hansvm · on Feb 27, 2023

Ah, so less a hash table and more vanilla KNN?

Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).

akomtu · on Feb 27, 2023

k-nearest neighbors? Sort of, but I'd rather describe it as a geospatial map in many dimensions.

hansvm · on Feb 27, 2023

There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.

For some intuition, imagine the following tasks:

> Repeat the following phrase exactly twice: "sdflhasdflhasdf"

> Repeat the following phrase exactly twice: "sdflhasdflhasdg"

Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.

The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.

akomtu · on Feb 28, 2023

Indeed, that's good counterexample.

make3 · on Feb 24, 2023

do they do instruction fine-tuning

saurabh20n · on Jan 31, 2023

The last author's tweet thread and replies have some interesting tidbits: https://twitter.com/Eric_Wallace_/status/1620449934863642624

* "We propose to extract memorized images by generating many times with the same prompt and flagging cases where many of the generations are the same."

* "- Diffusion models memorize more than GANs - Outlier images are memorized more - Existing privacy-preserving methods largely fail"

* "Stable Diffusion is small relative to its training set (2GB of weights and many TB of data). So, while memorization is rare by design, future (larger) diffusion models will memorize more."

* "It only memorizes a very small subset of the images that it trains on."

* "our goal is to show that models can output training images when generating in the same fashion that normal users do."

freilanzer · on Feb 1, 2023

> * "It only memorizes a very small subset of the images that it trains on."

An interesting question here would be: why does it memorise these images over others? Can the other images still be synthesised with loss via a suitable prompt? If so, are the memorised images important for this? Can this set be reduced further?

saurabh20n · on Aug 2, 2022

For the curious, here are direct links:

* Initialization was done 42 days ago: https://etherscan.io/tx/0x53fd92771d2084a9bf39a6477015ef53b7... -- "Click to see More" and notice "Input Data" parameter [2] which sets _committedRoot to 0x00.

* Click through the To contract to get to the code (click on Contract tab): https://etherscan.io/address/0xb92336759618f55bd0f8313bd8436...

Just adding direct links to what samczsun and 0xfoobar are talking about in https://twitter.com/samczsun/status/1554260106107179010 and https://twitter.com/0xfoobar/status/1554269071214088193/phot...

saurabh20n · on Jan 15, 2022

https://ericpony.github.io/z3py-tutorial/guide-examples.htm should be a quick start.

HN For You