Yes, I have worked with these products at a moderately large scale. I have also developed a wiki-like CMS from scratch myself, with templating, macros, and everything. The issues I saw would occur at any scale above "tiny". Essentially as soon as scale-out is needed for a dynamic site, caching also becomes mandatory.
Caching is not well understood at all. People think they understand it, but probably don't actually know most of the pitfalls, especially in the face of failure in the general case.
With Wikimedia, many of the issues are not super important. E.g.: transaction integrity is not relevant. The occasional 404 or eventual (in)consistency problem is also not a big deal.
Similarly, the user-based rendering is also fairly easy to handle. The vast majority of the content (the rendered wiki text) is identical-ish between users. The headers, footers, CSS, etc... do change. This can be handled on the client-side in a variety of ways, typically via JavaScript. Even with mostly static content, headings can be altered on the way out "at the edge" by a CDN or CDN-like system.
This is my point: If you're going to cache things in something like a CDN, then you're 90% of the way there to static content anyway! Take the leap and go the whole way to get the benefits. Some things need to be done 100%, otherwise it's like being almost pregnant.
Some random examples of static vs dynamic problems/comparisons:
- You mentioned templating: In the static world, you can regenerate all static pages based on the new template asynchronously at whatever slow "batch process" rate you desire. In a dynamic system, a change to a template would typically invalidate the entire cache and blow your CPU budget instantly, dragging down the synchronous rendering path into molasses. This can be managed, but it's complex and difficult. Similarly, code changes can similarly result in either mixed/corrupt cache content or instant CPU spikes. At least with static content you can manage the rollout by content instead of by server. A trick you can pull with static content is pre-generate the updated pages side-by-side and swap instantly. This is impossible with dynamic content generation. You either get mixed content as the caches slowly expire, or instant load spike.
- Variations such as mobile/non-mobile pages: These add to cache pressure, and can result in sudden performance cliffs where going from 90% cache utilisation to 110% can cause dramatic spikes in load. If you use static content, your "utilisation" is known in advance and changes slowly. Everything is served at the same speed, always. In fact, with systems like S3, your speed goes up as your data volume increases because of the way the sharding works.
- Memcache and the like are hilarious to me. These days the "standard" is to have layers upon layers of caching to paper over the fundamental bottleneck of the database tier. SiteCore has so many layers of caching that I lost count. Is it ten? Eleven maybe? Whatever. The point is that keeping that straight in your head as a web developer is so difficult that SiteCore keeps all caching off by default, murdering performance for most sites most of the time. It's just too "difficult" to have it on by default because devs would lose their minds. Just the access control issues of providing devs with access to shared Redis clusters or CDNs for purge operations is a task all by itself. In large enterprise, this is almost never done and then it becomes a tradeoff between cache TTLs and freshness/consistency. I've lost count of the number of times I've heard some dev tell users to "clear their browser cache" as the "fix".
- Cache purging: you can hide a fundamental performance problem under a layer of caching for years, have it grow to monumental proportions, and then blow up your production site for days while everything slowly recovers from 1,000% load. This has caused several large-scale outages that have hit headlines.
Look at it this way: Wikipedia is something like 99% read-only access and 1% write access. With a static hosting model the VMs would only need to be scaled to handle the write-throughput, not the read-throughput. The content could be put on cloud storage like S3 and that's it. The whole site could be hosted of a handful of small VMs just for HA/DR!
Which RFC? The RFC moving it to IANA in 2012? It's been in development in some way since the 80s [1], the current timezone names in it are from the 90s[2], and it was definitely already the standard timezone definitions when I started using Linux in the 00s.
It might help to think about time-series databases from the requirements they're addressing. "Time-Series Database Requirements" [0] is a good summary of the problem space.
Is anyone working on let's name it CookHub already? So essentially GitHub for receipts. The nerd in me wants to fork receipts and share my small adaptations to the community.
I see things going the same direction. I've always been pretty anti-dashboard because of the lack of long-term utility. If you want to make data-driven decisions, you need to spend the time codifying the decision making process and automating that. Automated actioning off data is far more impactful than automated visualization of data.
When it comes to business stakeholders, the biggest obstacle is trust. If you're not a data person, or a developer, decision logic being gated behind code is a scary black box. "If I can't control it, I can't trust it". That feeling gets even worse when we build systems that are controlled by AI or ML.
I think we need more solutions for data teams that allow business stakeholders to take part in the automated workflow deployment. Let them see how things are connected together. Let them verify that the decisions are being made correctly every day. Let them tweak levers so they have a say in how things are working. That's the only way to move beyond the current environment where everyone wants a dashboard, but no one looks at it.
That's a big problem I'm aiming to solve right now.
> English Wikipedia gets about 250M page views per day, which is only about 3,000 per second. If they used an efficient language instead of PHP, that's well within the capability of a single modern server!
This type of comment betrays a complete lack of knowledge about how large scale internet systems operate (which is something I've worked on for the last 10+ years) and comes across as incredibly naive. I can immediately point out huge flaws in your thinking:
* You're assuming that traffic is constant through the day (that you can divide daily traffic rate to get a sense of daily peak; or, in other words, you provision for mean usage, not peak)
* You're ignoring the problems of unpredictable hotspots (single articles suddenly becoming orders of magnitude more popular than the median in relatively unpredictable and spikey ways; think World Cup final or major earthquake)
* You're not reserving any safety margins for unexpected traffic growth, so you'd likely run into cascading failures
* You're assuming that the entirety of Wikipedia can be held in ram. Including all the history and such. Or, if not, that a single server has enough drives that spindle capacity won't be a problem.
* You're not even considering networking/bandwidth costs. How many network cards would your server need?
* You're ignoring any problems with replication/redundancy (e.g., no backups?) so your site would be a nightmare in reliability terms. Of course, once you do that, you'll need to reason about consistency problems.
I think your logical fallacy is that you're thinking that because your can't understand something (that running Wikipedia could cost so much), it must mean that the thing you can't understand isn't true. Instead, I'd suggest that you'd get further by focusing on discovering the limits of your understanding and genuinely asking yourself: "How could it be that it costs so much? What may I be missing?"
Former Paddle customer, don't believe the hype. Everything is duct tape and half assed with the API, the checkout process is happy path or bust, and you're going to have a hard time migrating to a different processor.
Stripe + TaxJar was cheaper and easier to implement and maintain.
> For example, I once joined a team maintaining a system that was drowning in bugs. There were something like two thousand open bug reports. Nothing was tagged, categorized, or prioritized. The team couldn’t agree on which issues to tackle
> I spent almost three weeks in that room, and emerged with every bug report reviewed, tagged, categorized, and prioritized.
Honestly, this is one of those traps a team can fall into, where nobody feels empowered to ignore the rest of the business for 3 weeks to put the bell on the cat. The only person without deliverables and due dates is the new hire. And it takes a special kind of new hire to have the expertise to parachute in, recognize that work needs to be done, and then do it with little supervision.
But he's right in general, that you can get some surprising things done by just putting in the time and focus. Which is why it's so utterly toxic that corporate America runs on an interrupt driven system, with meetings sprinkled carelessly across engineer calendars.
The "Queue Management for Inbound Content" section is exactly what I really want as well, so much that I've just started building it instead of waiting and hoping.
For me, this is the lynchpin:
> "all content I think my future self would appreciate me consuming." (emphasis mine)
Current content feeds are optimized for engagement (ie. advertisement load) and thus won't conceive of a "future self", only what your current self will look at and click on right now.
I think that content feeds need to incorporate goal-orientation and move away from right-here-right-now orientation. For those wanting to do anything difficult they need to optimize their information diet over a very long time-scale, like years, so content-feed tools should be aware of human-scale timelines (eg. high school, college, career, parenting).
Humans thrive on learning and growth but so many platforms choose to see their users as merely inputs to an ad-delivery optimization system.
> I have little visibility into required time investment and foundational context until I’ve opened it and started thinking about it.
This is another thing that really annoys me about our current media ecosystem, and is really also a symptom of not properly conceiving of a person's personal development over time and a person's changing needs over time.
---
To look at this blog post from 1000 feet up, I'd say that Jonathan is unfortunately deprived of these tools because our media software ecosystem is madly building things for users who want to look at things and not think, as such platforms are heavily consumeristic and thus fantastic for advertising revenue and monetization generally (eg. Instagram, Snapchat, TikTok, Pinterest, Canva).
Caching is not well understood at all. People think they understand it, but probably don't actually know most of the pitfalls, especially in the face of failure in the general case.
With Wikimedia, many of the issues are not super important. E.g.: transaction integrity is not relevant. The occasional 404 or eventual (in)consistency problem is also not a big deal.
Similarly, the user-based rendering is also fairly easy to handle. The vast majority of the content (the rendered wiki text) is identical-ish between users. The headers, footers, CSS, etc... do change. This can be handled on the client-side in a variety of ways, typically via JavaScript. Even with mostly static content, headings can be altered on the way out "at the edge" by a CDN or CDN-like system.
This is my point: If you're going to cache things in something like a CDN, then you're 90% of the way there to static content anyway! Take the leap and go the whole way to get the benefits. Some things need to be done 100%, otherwise it's like being almost pregnant.
Some random examples of static vs dynamic problems/comparisons:
- You mentioned templating: In the static world, you can regenerate all static pages based on the new template asynchronously at whatever slow "batch process" rate you desire. In a dynamic system, a change to a template would typically invalidate the entire cache and blow your CPU budget instantly, dragging down the synchronous rendering path into molasses. This can be managed, but it's complex and difficult. Similarly, code changes can similarly result in either mixed/corrupt cache content or instant CPU spikes. At least with static content you can manage the rollout by content instead of by server. A trick you can pull with static content is pre-generate the updated pages side-by-side and swap instantly. This is impossible with dynamic content generation. You either get mixed content as the caches slowly expire, or instant load spike.
- Variations such as mobile/non-mobile pages: These add to cache pressure, and can result in sudden performance cliffs where going from 90% cache utilisation to 110% can cause dramatic spikes in load. If you use static content, your "utilisation" is known in advance and changes slowly. Everything is served at the same speed, always. In fact, with systems like S3, your speed goes up as your data volume increases because of the way the sharding works.
- Memcache and the like are hilarious to me. These days the "standard" is to have layers upon layers of caching to paper over the fundamental bottleneck of the database tier. SiteCore has so many layers of caching that I lost count. Is it ten? Eleven maybe? Whatever. The point is that keeping that straight in your head as a web developer is so difficult that SiteCore keeps all caching off by default, murdering performance for most sites most of the time. It's just too "difficult" to have it on by default because devs would lose their minds. Just the access control issues of providing devs with access to shared Redis clusters or CDNs for purge operations is a task all by itself. In large enterprise, this is almost never done and then it becomes a tradeoff between cache TTLs and freshness/consistency. I've lost count of the number of times I've heard some dev tell users to "clear their browser cache" as the "fix".
- Cache purging: you can hide a fundamental performance problem under a layer of caching for years, have it grow to monumental proportions, and then blow up your production site for days while everything slowly recovers from 1,000% load. This has caused several large-scale outages that have hit headlines.
Look at it this way: Wikipedia is something like 99% read-only access and 1% write access. With a static hosting model the VMs would only need to be scaled to handle the write-throughput, not the read-throughput. The content could be put on cloud storage like S3 and that's it. The whole site could be hosted of a handful of small VMs just for HA/DR!