Some of the dplyr elegance comes from the flexible evaluation mechanism in R, whereby mutate(data, col1+col2) works because the second arg is evaluated in an enriched environment. Python eschews this kind of macro-like extensions because, my guess, tampering with evaluation makes a lot of other things complicated (for instance, forget replacing args with their value, that doesn't work anymore). I think the author of dplyr himself in later work has promoted the use of the ~ operator to explicitly block eval of an argument and at least make these departures from regular eval explicit. That means dplyr is ahead for interactive use, but for programming you have to switch to a separate API (the underscore "verbs") and that makes the transition from interactive work to coding a bit steeper. It's all trade-offs, and I am not saying that I know better than either the pandas or dplyr authors.
As to ggplot, if you believe the future of statistical graphics is in-browser and interactive, you should take a look at altair for python (I myself created a small extension to it called altair_recipes). It's based on vega, like ggplot anointed (but not quite ready) successor ggvis and uses the grammar of graphics (or on interpretation thereof) like ggplot, with extensions to interaction. Simpler than D3 by most accounts.
Check survivorship bias. The name refers to deadly scenarios, but what is fundamental to it is a bias in reporting that depends on the outcome of interest. The concept was first hammered out observing that warplanes never got hit in the engines -- well, the ones that made it back at least. Some people thought the areas with the most damage should be protected. Wald thought otherwise and it's him who is remembered to this day. "Moments of whimsy" are like anti-aircraft ammo. They hit everywhere, randomly. But we only report those that are associated with a life change.
The central limit theorem is in the limit for the number of variables in the sum approaching infinity. In the finite world, the article explains how it's done. The article is saying, the sum of lognormals is not normal. You are saying: take enough of them and it is normal. The article is still more accurate than your reasoning for 30 stories. From the wikipedia entry for Central limit theorem " As an approximation for a finite number of observations, it provides a reasonable approximation only when close to the peak of the normal distribution; it requires a very large number of observations to stretch into the tails". To prduce a 95% confidence intervals, you have to upper-bound the tails. All methodologies that are based on sum of subtasks estimates are not evidence based. But we knew already sw methodologies are not evidence-based, did we?
You are saying: take enough of them and it is normal.
This doesn't completely undermine your point, but that isn't what they are saying, I think. I read it as saying by CLT that the estimates of the mean of those distributions is normal and centered on [the mean you are actually interested in]. Tails are perhaps somewhat a red herring here, because you don't really care about them unless you are specifically trying to evaluate worst-case-but-really-unlikely.
Yes, that is correct. It's been a very long time since I studied statistics, so I'm not sure if the variance of a mean has the same confidence interval as the mean. I suspect not. So you would indeed need to have a very large number of samples to get good error bars. It's a good point which I hadn't really considered. However it will never really get that far anyway because hopefully you'll intervene before the long tail hits you.
I think those really long tails are more of a problem when you are working with "features" that are much longer. If you have 1 day stories and you've been working on the story for a whole week, you know you have a massive problem. It's time to back up and see if there is a way to break it up, or to do it differently.
If you have a feature that is a month, by the time you get to 5 months, you have so much capital invested in the original plan that it's very hard (politically) to say, "Nope... this isn't working out. Let's try something else". Of course, it is very hard to get your organisation to plan to a 1 day level of granularity.
Other decentralized techs had promise but were later re-centralized (FM radio is the first example that comes to mind). It seems to me techies think that the natural tendency to monopolies in the capitalistic economy can be countered by distributed protocols, and I think the evidence is against that. FM radio -> clear channel. TCP/IP -> NAT. SMTP/IMAP -> gmail, outlook, yahoo. http -> web giants. XMPP -> Facebook messenger. The idea that monopolies are obsolete by the time they form is preposterous. Recommended reading: "The curse of bigness" Tim Wu.
Besides bundles, the duplication of categories and labels hurts. Labels are second class (can't have a tab for instance) and categories are not user-generated. Lack of abstraction and orthogonality, a.k.a. "tacked on"
Bundles are like thematic inboxes. So I can go into work mode, family mode, news mode and stay there for a while but look only at things that have not been taken care of yet. If you try and replicate that with a filter, inbox actively prevents it (label: something AND in:inbox). It won't show anything. Even apple mail allows that. Also the de-emphasis of read/unread status. You can archive without reading more than the subject, and it doesn't show in all the unread counts etc. When unread counts are so much in your face, it's hard to ignore them.
While his admission concerns mostly the theory of priming, the problem is not specific but a methodological error. In my view all his research should be looked at with suspicion and with a view to have his experiments replicated independently sooner rather than later. This has been a problem throughout social psychology and other branches of science, so nothing specific to Kahneman, he's just the best known. Which brings up the next thought, which is I always look at information from the epistemic point of view: where does this knowledge come from: rational thought, experiment, experience, faith? I just notice, without judgement, how eager some of the posters here are to accept a theory or a philosophy even when it has already been debunked, or the evidence is flimsy, or maybe it's formulated in ways that are not even "debunkable". And there is a pragmatic view that if it feels right and it helps, why not.
It's good to look at previous experiences that failed, but conditions are ever-changing. Ad clickthrough is down, ad-blockers are up, more media outlets closing. Eventually, it will be advertorial content or subscription. Or some other awful end-state like that. Maybe on the way there there is a point where an alternative model that failed in the past, or a new one, is possible. I like the pay upfront, free returns model, block only abusers. It works for stuff where returns have so much friction, it should work for content.
It's simple, you retrain the model several times removing one sample at a time (leave-one-out cross-validation), until the erratic behavior disappears. You repeat until there are no more instances of erratic behavior. Sorry couldn't resist the attempt at humor, but also consider it as an argument by contradiction that what you are asking is not possible. It's like asking which lesson caused me to hit a forehand over the tramlines. The answer is the one I didn't have time for.
Some of the dplyr elegance comes from the flexible evaluation mechanism in R, whereby mutate(data, col1+col2) works because the second arg is evaluated in an enriched environment. Python eschews this kind of macro-like extensions because, my guess, tampering with evaluation makes a lot of other things complicated (for instance, forget replacing args with their value, that doesn't work anymore). I think the author of dplyr himself in later work has promoted the use of the ~ operator to explicitly block eval of an argument and at least make these departures from regular eval explicit. That means dplyr is ahead for interactive use, but for programming you have to switch to a separate API (the underscore "verbs") and that makes the transition from interactive work to coding a bit steeper. It's all trade-offs, and I am not saying that I know better than either the pandas or dplyr authors.
As to ggplot, if you believe the future of statistical graphics is in-browser and interactive, you should take a look at altair for python (I myself created a small extension to it called altair_recipes). It's based on vega, like ggplot anointed (but not quite ready) successor ggvis and uses the grammar of graphics (or on interpretation thereof) like ggplot, with extensions to interaction. Simpler than D3 by most accounts.