I've read the ALBERT paper extensively[0] and agree with what you mean. We can't boil down years of research efforts with just 4 bullet points.
The intention in the above blog post isn't to suggest that we discard detailed research papers. It's just thinking about additional mediums we can use to make papers more accessible(a good example is distill.pub and paperswithcode). As mentioned in other threads below, PDF can be a limiting medium.
I seem to find myself in the minority, but I don't think distill.pub is a particularly ideal model for publicizing research.
distill.pub heavily favors fancy and interactive visualization over actually meaningful research content. This is not to say that the research publicized on distill.pub is not meaningful, but that it is biased to research that can have fancy visualizations. So you end up seeing a lot of tweakable plots, image-augmentations, and attention weights visualizations. It is also further biased towards research groups that have the resources to create a range of D3 plots with sliders, carved out of actual research time.
For instance, I don't think BERT could ever make it into a distill.pub post. Despite completely upending the NLP field over the last 2 years, it has no fancy plots, multi-headed self-attention is too messy to visualize, and its setup is dead simple. You could maybe have one gif explaining how masked language modeling works. The best presentation of the significance of BERT is "here is a table of results showing BERT handily beating every other hand-tweaked implementation for every non-generation NLP task we could find with a dead-simple fine-tuning regime, and all it had was masked language modeling."
To give another exmaple: I think it's one of the reasons why a lot of junior researchers spend time trying to extract structure from attention and self-attention mechanisms. As someone who's spent some time looking into this topic, you'll find a ton of one-off analysis papers, and next to no insights that actually inform the field (other than super-trivial observations like "tokens tend to attend to themselves and adjacent tokens).
Oh for sure. PDF is tough for so many reasons. Remember that article about the apple programmer trying to implement the "fit text to screen width" thing for PDF a couple months back? PDF is sooooooooo challenging as a medium. Even something that reads and looks identical, but is different under the hood could be big improvement, apparently (I don't actually know how PDF works under the hood other than hearsay of "it's difficult"). In the spirit of chesterson's fence, maybe not though.
I totally agree that additional media could be good. I got caught up on the "most papers could be compressed to < 50 lines" line and misunderstood the premise you were presenting.
'term of art' means that there's a specific meaning inside a particular sphere of discussion for this term (which has normal sense elsewhere).
To accuse a research group of shingling means that you think that that group releases a lot of papers one after the other which have a lot of overlap between them and back-cite each other-- and that this is done to artificially boost publication count and citation count to make the group look prominent.
Started it to scratch my own itch. I'm a visual learner and learn the best when the maths is explained along with visualization and intuition. So, started this blog to share explanations of latest ML research using diagrams/code/analogies and linking them to associated math.
Yeah, since BERT has been trained specifically on this task of replacing random masked words, the resulting augmented sentence should sound natural compared to something like Word2Vec.
Interesting, thank you for sharing that. It reminds me of this approach called "PIRL" by Facebook AI. They framed the problem to learn invariant representations. You might find it interesting.
Agree with you on this. For text data, there was a paper called "UDA"(https://arxiv.org/abs/1904.12848) that did some work on this direction.
They augmented text by using backtranslation. Basic idea is you take text in English, translate that to some other language say French and then translate back the French text to English. Usually, you get back an English sentence that is different that the original English sentence but has the same meaning.
Another approach they use to augment is to randomly replace stopword/low tf-idf(intuitively say very frequent words like a, an, the) with random words.
You will find implementation of UDA on GitHub and try that out.
I am learning these existing image semi-supervised technique right now and the plan is to do research on how we can transfer those ideas to text data. Let's see how it goes.
Wanted to clarify that this is the summary article of the paper. I wrote it to help out people who might not have the math rigor and research background to understand research papers but would benefit from an intuitive explanation.
I would argue that folks like you translating the heavy science into comprehensible ideas to those less deep into the field are doing just as much to advance science as the authors of these papers.
Seriously, this is fantastic work and I cannot compliment you enough on it.
The novelty is in applying 2 perturbations to available unlabeled images and use them as part of training. This is different than what you are describing about applying augmentations to labeled images to increase data size.
My immediate question was "how do you use unlabeled images for training?" But then I decided to read the paper :) The answer is:
Two different perturbations to the same image should have the same predicted label by the model, even if it doesn't know what the correct label is. That information can be used in the training.
What if the model's prediction is wrong with high confidence? What if the cat is labeled as a dog for both perturbations? Then wouldn't the system train against the wrong label?
Nope,because of the way it works. So in the beginning when the model is being trained on the labeled data, it will make many mistakes. So it's confidence for either cat or dog will be low. Hence, in that case unlabeled data are not used at all.
As training progresses, the model will become better at labeled data. And so it can start predicting with high confidence on unlabeled images that are trivial/similar-looking/same distribution with labeled data. So, gradually unlabeled images get started being used as part of training. As training progresses, more and more unlabeled data are added.
The mathematics of the combined loss function and curriculum learning part talks about this.
The intention in the above blog post isn't to suggest that we discard detailed research papers. It's just thinking about additional mediums we can use to make papers more accessible(a good example is distill.pub and paperswithcode). As mentioned in other threads below, PDF can be a limiting medium.
[0] https://amitness.com/2020/02/albert-visual-summary/