A lot of people are going to dump on this billion dollar valuation, but we don't have the terms. Presumably there is liquidation preference. According to Crunchbase [1], the total capital raised is about $160 million after this money, so the investors don't need this company to be worth $1 billion to come out fine. Sam Altman made the point a few days ago that a lot of these late-stage private financings are kind of debt-like [2, about paragraph 8-9 and footnote #2], with the result that the valuations aren't really meaningful.
I wouldn't be so down on vocational training for software...It is a field that has an unusually high requirement for ongoing learning. And when you have to learn some new tech, usually you have to do it now, and in-place, not next September, in the nearest college town. Given the high ongoing learning requirements that software has, I think there is a pretty clear need for training that's delivered where you are and when you need it, and traditional schools won't be able to do that.
Oh. To be clear. I think on-going training is great especially given the pace of change. And things like automated evaluation of code is something that actually works pretty well in MOOCs in my experience. (Unlike peer grading.) However, the argument for platforms like Udacity hasn't in general been that they're a great advancement for vocational training but that they disrupt higher ed generally. And that's probably a different valuation metric.
That depends on how sensitive higher ed is to competition from vocational training for things like programmers, right? Something like Udacity isn't going to replace a traditional CS degree, but something more like an IT degree could just be high-end vocational training offered by a traditional four-year institution, and as college costs rise, it could start to eat into the lower end.
I feel that the ongoing learning requirements that software has lead to developers becoming VERY good at learning new software.... which means they don't need training, because they can self-teach. At least that has been my experience.
How are people who are starting from zero on this supposed to understand it in 60 days? This disclosure only a little more than a transparency fig leaf.
By the usual means that complicated issues are explained to the public - professional journalists poring over it for clauses of public concern. How many people actually read through the ACA, or the myriad budgets and semi-budgets passed in the US over the past decade with much even shorter public-disclosure periods than this?
The ACA wasn't negotiated in secret. Budgets are normally continuing resolutions; they only negotiate the delta from previous years, which is usually small. And again, budget negotiations don't involve years of secret negotiations.
I'm sorry, continuing resolutions are now our ideal? And even then, the full process is on a similar timescale to just the public disclosure period of this treaty.
Same with respect to full budgets - they're long, they're complicated, but the period during which they're publicly available is on a similar timescale to this 60-day period.
>>I'm sorry, continuing resolutions are now our ideal?
I did not say that continuing resolutions are an ideal. My point was that although budgets are complex, only a relative small part of them changes from one cycle to the next. People don't have to digest the whole budget, because they already know what was in it before. They only have to understand the changes.
The point about the late-stage investments being not-really-equity is a great point. So my question is: why do financial journalists just about always miss it when they write about tech unicorns?
In the post, PG states that First Round's study is evidence of gender bias in VC financing. But footnote [2] is important: Uber was excluded as an outlier. Now...excluding Uber is reasonable (it is sort of an outlier), but so is not excluding it (it was a company that First Round invested in). When the conclusion from a data analysis depends on which way you go on something like this - which of two reasonable alternatives you pick - then the results are fragile and they don't really support either conclusion very well.
I've played around with this some, and the recognition isn't perfect, but I'm very impressed with how well it can pick up a new pattern from even just one example.
It seems that the main issue here is that with pre-registration, study authors have to pick a single measure of primary benefit at the outset, whereas before, they might have made that choice after getting results back. The original study is at PLoSONe, and it is not a difficult read[1]. From that source:
>>Prior to 2000, investigators had a greater opportunity to measure a range of variables and to select the most successful outcomes when reporting their results... Among the 25 preregistered trials published in 2000 or later, 12 reported significant, positive effects for cardiovascular-related variables other than the primary outcome.
That is, in most cases, there are large effects for some outcome, and if they get to choose the primary outcome after looking at some results, they could have been cherry-picking the outcome variables.
> It seems that the main issue here is that with pre-registration, study authors have to pick a single measure of primary benefit at the outset, whereas before, they might have made that choice after getting results back
Right. And then you'd have to use proper statistical reasoning for that state of affairs. Which nobody ever does, cause they're not statisticians and it's complicated and it would reduce the chance of 'statistical significance'.
So they just use a standard calculation of statistical significance -- which is based on the assumption that you have picked a single hypothesis in advance and then done your test. So it's completely invalid to use it how everyone typically does.
Imagine you flip a coin 50 times. Then you see, okay, did I ever get 10 heads in a row? Nope? Okay, how about 5 heads followed by 5 tails? Nope. Okay.... try a couple dozen other things, oh, look, I got exactly 3 tails followed by exactly 3 heads followed by exactly 3 tails again! Let's run my test of statistical significance to see if that was just chance, or is likely significant -- oh hey, it's significant, this is likely a magic coin not random at all!
Nope. If you test everything you can think of, _something_ will come up as 'statistically significant', but it's not really, those tests of statistical significance -- which calculate how likely it is the results you got happened by random chance happenstance vs an actual correlation likely to be repeatable -- are no longer valid if you go hunting for significance like that.
if they get to choose the primary outcome after looking at some results, they could have been cherry-picking the outcome variables
Not just could, would. Choosing your hypothesis after you run the experiment is (or at least should be) a cardinal sin in science for good reason. At the standard p-value cutoff of .05, even when there's absolutely no effect going on the probability of getting a spurious positive result when you do n comparisons is equal to 1 - (.95^n).
So that 5% chance of a type I error if you only look at one test statistic jumps to 40% if you look at ten, and to 72% if you look at 25.
I have a question. Would I be on firm statistical footing if I started looking for effects after the study, as long as I choose a p-value such that (1-((1-p)^n) < 0.05?
In other words, I run a study to determine if jellybeans cause acne[1]. The result is inconclusive (p < .05). Now--after the results are collected--I wish to check the correlation between color and acne. There are 20 colors. Would it be statistically sound to "correct" for my cherrypicking by setting p = 0.0025? That would result in
> Would I be on firm statistical footing if I started looking for effects after the study, as long as I choose a p-value such that (1-((1-p)^n) < 0.05?
No. For one thing, how do you actually know n? If in your analysis you can make c independent binary choices, n would be on the order of 2^c. For most reasonable sequence of choices, n would be impractically large. And if you think you know n, how would you convince everyone else that your value of n is reasonable/trustworthy?
The xkcd case (which is solved by applying the Bonferroni correction) is special because there's only one test run in 20 experiments, so the correction is more straightforward.
There is a good paper about this garden of forking paths by Gelman and Loken [1].
> At the standard p-value cutoff of .05, even when there's absolutely no effect going on the probability of getting a spurious positive result when you do n comparisons is equal to 1 - (.95^n).
That's only the probability of getting a positive result by chance; the total probability of getting a positive result even if there is no effect is much higher.
"Probability of getting a positive result by chance" and "probability of getting a positive result when there is no effect" are the same thing. This is what the p-value measures. (The full phrase is generally a combination of the two: "probability of getting a positive result by chance when there is no effect"; but that's so long that people shorten it.)
There are lots of reasons you could get a false positive result that have nothing to do with chance: e.g. non-representative sample, miscalibrated equipment, biased methodology, researcher fraud, flawed statistical analysis, etc.
Indeed, but the comment that you quoted was referring to the case where the methodology and p-values are sound, and the only issue is testing multiple hypotheses without correction.
Cardinals and sin are the realm of religion. Not sure why it's being brought up here.
Choosing a hypothesis after the experiment is run is perfectly valid as long as your experiment is valid for that hypothesis. Besides, you would always run a new experiment again anyway.
So perhaps it's worthwhile to trot out the idea of exploratory vs. confirmatory research.
In exploratory research you collect a bunch of data, and then mine it for interesting associations that might merit further study. It's an essential part of the scientific process, but anything you find from doing it needs to be treated as extremely tentative because it's liable to produce spurious results at least as often as it finds genuine effects.
But that's not what this paper's talking about. It's talking about experiments that are being used to support the approval of new treatments and drugs. That's confirmatory research. In that realm you absolutely must paint the target on the wall before you throw your darts.
Those are ways of looking at it, but experiments have definite structures and that structure can be exploited to create experiments that potentially reveal more information than another experiment.
If the experiment is constructed correctly, it can support validation of multiple hypotheses.
I would completely believe that most experimentation being performed currently is not structured to make further exploitation possible.
I should have clarified that statement as I don't think it's quite correct.
It should read:
I would completely believe that most experimentation being performed currently is not structured to make further exploitation, of the type desired/wished, possible.
There are almost certainly facts available that are not discovered/discussed from past experiments. Many of them are likely trivial and/or not what researchers would wish or hope that their data could tell them. However, they can still be mined.
In any case though, you would still re-run experiments to further validate/reject the hypotheses. That is simply basic science.
Yes, you can always choose any hypothesis you want. It's largely irrelevant.
Every experiment will support analysis through a set of hypotheses. Just because you didn't select all of those hypotheses before the experiment ran doesn't mean you can't select it after the experiment.
Imagine that an experiment has been run, but you do not know the results (or even what was done). Now you select a hypothesis, if the experiment required to validate the hypothesis is the same as what was run previously, you can now look at and use the results.
A hypothesis is like running a query against a database. Many queries are valid, even though the data may not have changed.
>Otherwise, why are you re-running it?
Science requires it. Doctrine from one-off experimentation is religion (hard to dump).
It's fine if you run a separate experiment to justify the hypothesis. If you choose the hypothesis after conducting the experiment, then the p-values you obtain for that hypothesis are invalid.
To be fair, Viagra was originally being developed as a high blood pressure medication. They decided to switch to erectile dysfunction when they found out why study participants were hoarding the pills.
Pre-registration would not hinder them doing that discovery, but they would have had to do another registration and study to get it to the market with the new use. They would likely still be able to use the outcome of the original study to assess the safety of the drug.
Which actually makes an important but subtle point: just because the rate of positive effects identified went down doesn't mean the effects that would otherwise have been identified were all false. It just means we don't know. There may be extremely strong evidence of effects that would overcome any amount of multiple testing correction, but they still aren't allowed to use it. It might take years or even decades for them to do a follow up study to validate the result, which means a significant number of people are harmed by not having access to the drug in the interim. Just playing devil's advocate to make the point that it's not a given that we're getting an better overall outcome by being this stringent.
> [...], which means a significant number of people are harmed by not having access to the drug in the interim.
Yes, that might happen. The more likely outcome though is that we are saved from a lot of drugs that don't work any better than chance (or even worse).
Fair enough... but the issue here is about findings of an effect versus no effect. It's about statistical significance. In order for single-comparison p-values and the like to be valid, there has to be a (one) single comparison. There is a way to do 'any-of-k' testing, but the required effect sizes get larger.
The blog post seems to be getting modified at this moment. When I first saw it, it didn't have anything about the misclassifications, but that has been added now.
[1] https://www.crunchbase.com/organization/udacity#/entity
[2] http://blog.samaltman.com/the-tech-bust-of-2015