It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.
we're not sure if the actual data exactly matches our reconstruction, but one of the authors pointed out to us that we can exactly reproduce their scaling law if we make the mistake they made when fitting it to the data
what they did was to take the mean of the loss values across datapoints instead of summing them and used L-BFGS-B with the default tolerance settings, so the optimizer terminated early, and we can reproduce their results with this same mistake
* 3 trillion tokens; start with BPE tiktoken, cl100k base vocab, augmented with chinese, numbers split into digits, final vocab 152k.
* RoPE - rotary positional embedding
* context length 2048
* qwen-14b perf percentages: 66.3 MMLU(5), 72.1 CEval(5), 61.8 GSM8K(8), 24.8 MATH(4), 32.3 HumanEval(0), 40.8 MBPP(3), 53.4 BBH(3); beats LLaMA2-13B on all, but behind LLaMA2-70B on all except CEval, MATH and HumanEval (somewhat surprising)
* code-qwen-{7B,14B}
* additional 90B code tokens over base
* context length 8192, flash attention
* 14B perf: humaneval 66.4, mbpp 52.4; ok, but not stellar (similar numbers as OSS wizardcoder-py, and lower than gpt-3.5)
* math-qwn-{7B,14B}-chat
* math instructional dataset
* context length 1024
* 14B perf: gsm8k 69.8, MATH 24.2, Math401 85.0, Math23K 78.4 (substantially better than OSS in the same weight class (WizardMath and GAIRMath-Abel) on MATH but same ballpark on GSM8k -- surprising). Math23K is chinese grade school math; and Math401 is arithmetic ability.
* comprehensive automatic evaluation in Appendix A.2.1 pg 36 (based on OpenCompass'23)
* chat format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today?<|im_end|>
Notes from quick read of paper at https://arxiv.org/abs/2302.10866. Title of popsci is overreaching, this is a drop-in subquadratic replacement for attention. Could be promising, but to be seen if it is adopted in practice. skybrian (https://news.ycombinator.com/item?id=35657983) points out
new blog post by authors, and prev discussion of older (march 28th) blog post. Takeaways:
* In standard attention in transformers, cost scales quadratically with length of sequence, which restricts model context. This work presents subquadratic exact operator allowing it to scale to larger contexts (100k+).
* They introduce an operator called "Hyena hierarchy", a recurrence over 2 subquadratic operations: long convolution, and element-wise mul gating. Sec 3.1-3.3 define the recurrences, matrices, and filters. This is importantly, a drop in replacement for attention.
* Longer context: 100x speedup over FlashAttention at 64k context (if we view flash attention as an non-approx engg optimization, then this work is improving algorithmically, and getting OOM over that). Associate recall, i.e., just pull data, show improvements: Experiments on 137k context, and vocab sizes of 10-40 (unsure why they have bad recall on small length sequence with larger vocab, but they still outperform others)
* Comparisons (on relatively small models, but hoping to show pattern) with RWKV (attention-free model, trained on 332B tokens), GPTNeo (trained on 300B tokens), with Hyena trained on 137B tokens. Models are 125M-355M sized. (Section 4.3)
* On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar to GPTNeo (although technically they underperform a bit for zero-shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)
* Because they can support large (e.g., 100k+) context, they can do image classification. They report ballpark comparable against others. (Table 4.7)
Might have misread some takeaways; happy to be corrected.
Congrats on the launch. I think you should share some technical details for a more substantial pitch. You are using the OSS BigCode effort and "The Stack" [1, 2] (as you say in another comment), which is great.
A few questions that might help an enterprise customer: How big is your base model? Where did you find more datasets (maybe just a hint would be sufficient)? Are you using SantaCoder [3]? Anything you can say about your fine-tuning that makes it special? Totally on board with you that HumanEval/MBPP are not great benchmarks for real world, and do you have a suggested alternative to help me see the value?
The calculus for an enterprise customer might be: "We could fine tune a 6B model on our internal code and internal benchmarks (say with a month of work, a few thousand in compute, 2 people on task), but I'd rather buy an off-the-shelf solution like codecomplete.ai. They give us XYZ benefits." Articulate the XYZ for a technical decision maker who will be your target audience.
Great questions. We want to keep some of our technical details closer to the chest, so I won't go into the specific technologies we're using here.
I will expand a bit on fine-tuning. It's really hard to get this right, and the iteration speed is slow. Of course these companies can build their own, but we want to save them a lot of headache.
So far, we haven't found any off-the-shelf open source base model that works super well for code completions. We've augmented models with a huge amount of data in order to see our current performance, and we ran into a lot of pain along the way.
* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]
* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]
* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.
* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]
* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)
I hate when people don't include approximation for traning before final hyperparameters are found as its most costly part of whole process most of the time.
Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-
> we used 2048 A100-80GB for a period of approximately 5 months
Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?
Wolfram Alpha:
- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)
- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)
We still have the edge.
LOL, I'm being downvoted, I wonder way. Some don't like the question.
You have to include our evolutionary history too. A considerable amount of our sophisticated behavior doesn't require specific training, as it is encoded in our genetic and epigenetic systems. We aren't starting from zero.
Then you would need to include the our history in the GPU calculation. GPUs require evolutionary bootstrapping - they didn't materialize alongside the first few hydrogen atoms post BB.
Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.
A thing to keep in mind is that 1 MWh of raw calories takes much more than 1 MWh to produce (fuel for tractors, inefficiency of meat etc). The GPU energy is also easier to make renewable.
I did an extremely rough calculation recently that the training of GPT-3 is comparable to one transatlantic flight (all passengers combined) in terms of emissions, very depending on the energy mix of course.
That's the entire problem. There's so much more energy that goes into a modern human beyond just what they eat. Beyond physical items you've listed like clothing there's also education and healthcare. Those two institutions are critical in making a modern human and they both have their own dependency chains of physical resource, energy, and the input of even more humans.
I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.
At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs
90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.
A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.
>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.*
what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?
umm...so does OpenAI. In fact this is OpenAI discovery from [1]:
>Convergence is inefficient: When working within a fixed compute budget C but without any other restric-
tions on the model size N or available data D, we attain optimal performance by training very large models
and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would
therefore be far more sample efficient than one might expect based on training small models to convergence,
with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)
>We have also tested our models on a set of additional text data distributions. The test loss on these datasets
as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2
dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct
parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the
in-distribution validation loss, and does not depend on the duration of training or proximity to convergence.
We also observe no dependence on model depth (see Appendix D.8)
This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556
By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table?
Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.
Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).
That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.
Your question feels like it has a motive though. What are you really asking?
LLMs need a baseline to compare with. I suspect that when they get compared with a fuzzy hash table of a similar size (that returns a range of probabilities), their performance will become unimpressive.
You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.
LLMs can track mid-range dependencies though. Consider the following input
> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.
Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.
Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.
None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.
The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.
The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.
Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).
There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.
For some intuition, imagine the following tasks:
> Repeat the following phrase exactly twice: "sdflhasdflhasdf"
> Repeat the following phrase exactly twice: "sdflhasdflhasdg"
Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.
The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.
* "We propose to extract memorized images by generating many times with the same prompt and flagging cases where many of the generations are the same."
* "- Diffusion models memorize more than GANs - Outlier images are memorized more - Existing privacy-preserving methods largely fail"
* "Stable Diffusion is small relative to its training set (2GB of weights and many TB of data). So, while memorization is rare by design, future (larger) diffusion models will memorize more."
* "It only memorizes a very small subset of the images that it trains on."
* "our goal is to show that models can output training images when generating in the same fashion that normal users do."
> * "It only memorizes a very small subset of the images that it trains on."
An interesting question here would be: why does it memorise these images over others? Can the other images still be synthesised with loss via a suitable prompt? If so, are the memorised images important for this? Can this set be reduced further?
It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.