An odd example, as fireflies are still pretty big in the places they have always been, aren't they? I know when I get to visit my childhood states, they are still there. Similar for cicadas and other bugs of my youth that I didn't realize were far more local than I expected.
It was just a recently notable example. Even as of 2-3 years ago I used to see them a decent amount. They're a highly visible marker of an insect population that is dropping like a rock.
They're also a beautiful creature that I could imagine wishing a child of mine could experience the same way I did, which better illustrates the tragedy of the damage we're doing to the planet.
I'm assuming you still live in the same place? My understanding the last time I took a dive on this is that the numbers are going down, but not in any way that is going to see them gone. You will need to go to where they are, though. And, alas, the PNW is not a place to find them.
I confess a sad assumption that bot traffic is far higher than we have admitted for a long time. Though, maybe we would see different stats specifically to social media sights to astroturf like counts? Certainly feels that we have known for a long time that bots were larger in ad viewing than ad companies wanted to admit.
I don't understand what difference bots make. For me, a website (the public part) is a storefront. People walk down the street and see what's inside — that's the purpose. If something should not be available immediately, that's the private part of the store.
I've been monitoring bot traffic on digital platforms for over 10 years. Sure, the crawler share is growing, some even with malicious intentions, and those I detect and block.
I disagree that this pain is worth the cost of making real people spend their life on verification.
For ad views, the concern is specifically that people pay for clicks and views. That that can be so heavily influenced by bot traffic greatly undermines their value.
Same general idea goes for any of the algorithmic driven platforms. The algorithms are ostensibly intended to surface organically discovered things by watching how people interact with things. That they are so susceptible to distortion through bot farms should be a lot more acknowledged than it is. People trust them far more than they should.
There is also a general cost of running things concern. It isn't like it is completely free to execute on bot traffic.
For ads, I believe this must be a problem for ad platform owners.
If the digital platform's storefront is their business, they could afford to spend some budget on bot detection. Bots still come from data center networks, sometimes render pages incompletely, request resources in bulk, and show enough patterns to be flagged internally.
If we look at a medium website, most random crawlers will come from Amazon, Microsoft, DigitalOcean, Hetzner, OVH, and a few other DC networks — these can be blocked easily without harming real users. The rest can be detected and cleaned up, even manually.
The math is simple: 20,000 visits a day at 15 seconds each = ~83 hours a day lost watching a Cloudflare logo, just because someone doesn't want to dig into the logs. I don't buy it.
Largely agreed, though I think you are likely underestimating how hard this is to detect. In particular, it is true that many bots can be hosted in data centers, but it is somewhat trivial to launder that traffic through other sources. Malware, in particular, is what I have in mind. Maybe I'm wrong and that has largely gone away?
There is also a bit of mixed incentives. Yes, it is the ad platform that is getting abused. But it is also the ad platform that is charging people based on abused practices.
And it isn't like this is completely made up. Just look at how facebook killed a lot of ton of people during the "pivot to video" programs. I don't know all of the details, as I was thankfully not in any of the involved industries, but my understanding is it is fairly well documented.
Edit: I changed an "isn't" to "is." I think I was trying to reword at one point, but left it in a way that is opposite what I meant.
For efficiently-hosted sites with little media it's not too bad. E.g. hosting a static site just doesn't cost much, even if you're hammered occasionally.
That's extremely far from all sites though. It's probably safe to say it's a severe minority, particularly when you ignore personal / non-profit-bringing sites. Tons of small and large sites run stuff like poorly-written wordpress or ruby on rails or thousands of microservices doing god knows what. A major increase in request volume on those can easily mean significant increases in hosting charges (e.g. small-% on big, many multiples on small) or significant effort in optimizing (which is expensive too).
The website I mentioned has over 15k webpages and ~200 GB of media, and yet we monitor bots manually and only block them if they're pulling 5k requests in a row. Malicious URLs, multiply 404 are blocked by default. HEAD request rejected.
Even on a very bad day, the server's page load time doesn't go over 1s.
However, it seems like I'm indeed looking at the problem through the wrong prism, as what I've seen from the comments suggests that the initial issue is performance, and the bots are what uncover it.
I think a good chunk of it is bot-induced performance problems, yea. Whether that's compute or transfer. And advertisement costs.
Optimization is very very much not a solved problem though, just look at basically all software ever written - it's written for an optimization priority and to a price point (whether commercial $$ or via personal time), and that target's value to its users has shifted rather dramatically.
This is really interesting. I indeed looked at this problem from the wrong perspective.
I'm working on an open-source tool that could be useful for bot detection, but I'm still not confident that anyone would deploy it on-prem and make the setup/maintenance instead of just routing traffic through the cloud.
I think you'd definitely find some interest, e.g. anyone that intentionally avoids "the cloud" will want something local. Honestly I assume there are some of these already, monitoring apache/nginx/etc logs. Anubis is arguably similar and has been exploding lately, for example, though I'm not sure if it auto-updates its rules at all: https://github.com/TecharoHQ/anubis
As to if it'd get enough interest: yea no idea at all. I wish you luck tho! Clearly there's a need for this kind of thing.
Our team develops a risk-based analytics system that we also use for bot detection. From our perspective, bots shouldn't be blindly blocked, but rather properly monitored and blocked only when necessary. Here is a live demo (1) to give you a general idea.
When most of your server capacity is going to answering the scrapers it matters. It's not that the stuff is hidden, it's that storefront being flooded with 10x as many customers as the fire code allows. And some of them go around asking your employees mindless questions. (Small forum I help moderate: we were getting hammered with what was probably some sort of AI that was taking search queries and feeding them into the forum search. Search is now registered users only.)
> When most of your server capacity is going to answering the scrapers it matters
I've been dealing with the web since the previous century and still haven't managed to build a website that could be hurt by scrapers visiting it.
If you went through the logs, you'd probably see that these bots are on a single IP or subnet, which can be easily detected and blocked instead of closing off search to non-registered users.
Well the fun things is that no one knows how much traffic of what kind they are getting when they use Cloudflare.
You get the numbers that Cloudflare tells you, but who knows if you can trust their stats after their CEO is apparently cherry-picking data to shape their product narrative?
That same CEO too that just went on a wild tone-def layoff justification, classifying human employees into roles of either a builder, seller, or measurer and saying he wants to get rid of everyone that "measures" the business...
I wouldn't trust a single thing coming out of his mouth.
If it helps, I have found the attitude that writing is mostly for the writer to be healthy in continuing the practice. And it largely tracks with how I feel I have had better understanding of things I have documented than those I have not.
I had a good laugh when Haiku's thinking summarization referred to mayor Mamdani as a, quote, "known anti-Zionist." :-) Probably a good thing to remember is that the value added in RLHF is not partly biased, or biased, but itself bias.
(Context: I asked it to write fake Reddit comments, because I was curious about how realistic they could be. The colorful phrase occurred during its reasoning about the requested subjects.)
In English, the word "known" is generally placed in sentences like, "known sympathizer," more often than in "known Democrat." Compare, "suspected," contrast the more neutral, "is an."
When a researcher discovers that smoking is damaging to the lungs, do they need to provide a solution that allows people to smoke without damaging their lungs? Would their inability to provide a solution take anything away from the research?
Acute would imply that we should flat out stop. Chronic would imply looking for plans to work on it. Acute and chronic would imply that we should both stop and take action to address damages.
If you’re referring to a solution to large datasets without not being auditable, she actually did provide a solution. Something to do with data sheets for these training data sets similar to those provided for hardware components. At least, if my memory serves me.
I was more irked by the diversity of teams developing these concern. Which, feels like a benign enough concern, but not one where you can just stop progress.
Worse, I think it is a ridiculously safe bet that the US was home to the most diverse teams you could get for this sort of work. Asking the good faith participants to stop participating would have decreased the stated goal.
If the criticism can't distill up from "bad things could happen", it just isn't useful to keep paying people to come up with that kind of critique.
And it isn't like we stopped paying attention to these concerns, is it? Nor were they completely blind siding us at the time. The question was largely of what to do about them.
The question also whether large-scale utilization of LLMs (and also the prerequisite increased training processes) should proceed before these issues were addressed. Clearly, we collectively answered "yes" without any actual reasoning (and arguably, without any collective decision making either).
This feels incoherent. I'm game to agree that there were and are poor decisions being made. But are you proposing that we could have stopped all progress until these vague concerns were addressed?
For some of the concerns, like language understanding, I can't bring myself to think that many of the experts out there were doing any better than these models can do today. Quite the contrary.
And do you think that that would not have been counter to the concern over diversity of teams working on it?
Or concerns over bias going away by having the US attempt to abstain? Good luck with that. It sucks, but China and Russia should stand as stark examples that it turns out you can take strong control over the internet.
It’s pretty common in the security world to have a red team and a blue team. There is overlap in the skillset for both, but there are good reasons to have separate people develop each team, and we wouldn’t expect people to have a talent for both.
Ideally, we like it if the red team can suggest solutions, but that’s not always their job or expertise and I’ve rarely if ever heard someone express the sentiment you are within that context by suggesting a really good red team person isn’t useful if they can’t fix the holes they find.
This is borderline silly, though. It is clumsy to start. But so is walking. As is running. Have you seen people start out on bicycles? What about writing? Talking?
That is to say, all things start out clumsy. And people that are good at it, no longer feel that it is clumsy. Which is why a lot of people that have been working with this for any time just don't think of this much.
If a tool is clumsy, we try to improve it, that has been the case since the first stone artifacts created a million years ago.
Do you think that the (sort of) tree-based affordances that most modern code editors do support, like autoindentation and brace pairing/enclosing, are silly too? What about some slightly more advanced features, like the AST-based "extend selection" and "move statement up/down" features in JetBrains IDEs?
Or do you think that the status quo just somehow happens to be exactly right and going any further would be silly?
It would be silly strictly for how strongly worded it is. I should also say that there is nothing wrong with being silly. Someone may actually come up with something some day that meaningfully changes us here.
That is, I am not disagreeing that it can be a little bit clunky. But, a lot of the power that experienced users have in reading code is specifically that they have built a bit of automaticity in reading it. That is, the clunky aspects of fixing it is something you pretty much have to do. You just build speed at automatically doing it rapidly.
So, the status quo is to use the helper functions that you want to use. But usually after you get the experience in the clunky phase.
Holy crap is that an amusing/depressing video. Assuming the financial shenanigans outlined in it are even partially accurate, how the heck is this getting allowed?
When it comes to dealing with the abuse of power by those who hold power, the question is not "who's allowing them to do this?", it's "who's going to stop them?".
I'm torn. On the one hand, this is not too uncommon of a problem to run into. On the other, poor practices from coworkers are not going to go away thanks to a language filter.
So, the question will come down to which causes more grief, people abusing this convention, or people that overly use the language features that combat it? It is the standard optimization question between poor practices and enforcement that you have in any question of enforcement.
I would be delighted if we could get some empirical data on this.
reply