More

wtarreau · on Feb 28, 2025

If you transfer large objects, H2 on the backend will increase transfer costs (due to framing). If you deal with many moderate or small objects however, H2 can improve the CPU usage for both the LB and the backend server because they will have less expensive parsing and will be able to merge multiple messages in a single packet over the wire, thus reducing the number of syscalls. Normally it's just a matter of enabling H2 on both and you can run some tests. Be careful not to mix too many clients over a backend connection if you don't want slow client to limit the other ones' xfer speed or even cause head of line blocking, though! By typically supporting ~10 streams per backend connection does improve things quite a bit over H1 for regular sites.

wtarreau · on Feb 28, 2025

It's amazing how people having visibly never dealt with high loads can instantly become vehement against those reporting a real issue.

The case where ports are quickly exhausted is with long connections, typically WebSocket. And with properly tuned servers, reaching the 64k ports limit per server comes very quickly. I've seen several times the case where admins had to add multiple IP addresses to their servers just to hack around the limit, declaring each of them in the LB as if they were distinct servers. Also, even if Linux is now smart enough to try to pick a random port that's valid for your tuple, once your ports are exhausted, the connect() system call can cost quite a lot because it performs multiple tries until finding one that works. That's precisely what IP_BIND_ADDRESS_NO_PORT improves, by letting the port being chosen at the last moment.

H2 allows to work around all this more elegantly by simply multiplexing multiple client streams into a single connection. And that's very welcome with WebSocket since usually each stream has little traffic. The network also sees much less packets since you can merge many small messages into a single packet. So there are cases where it's better.

Another often overlooked point is that cancelling a download over H1 means breaking the connection. Over H2 you keep the connection opened since you simply send an RST_STREAM frame for that stream in the connection. The difference is important on the frontend when clients abort downloads multiple times per browsing session (you save the TLS setup again), but it can also make a difference on the backend, because quite often an aborted transfer on the front will also abort an H1 connection on the back, and then that's much less fun for your backend servers.

mike_d · on Feb 28, 2025

> It's amazing how people having visibly never dealt with high loads

I've built multiple systems at 1M+ r/s and Tb+ scale.

> The case where ports are quickly exhausted is with long connections, typically WebSocket

Yes, HTTP2 is great for websockets. I was never advocating against it. The comment I was replying to was under the false assumption that you needed an outbound backend connection for every incoming connection. All of his concerns are solved problems in any modern open source load balancer. See https://www.haproxy.com/blog/http-keep-alive-pipelining-mult... ;)

wtarreau · on Feb 28, 2025

But it's the same for other long sessions such as slow downloads and git clones. Sites concerned by the number of source ports are not those dealing with just favicon.ico and bullet.png, but mainly those dealing with long transfers.

Also there's a cascade effect on large sites, where as long as your servers respond fast, everything's OK. Suddenly a database experiences a hiccup, everything saturates, and once you enter the situation where the LB has all of its ports in use, it can take a while to recover because of connect() getting much slower (I already observed delays up to 50ms!). At this point there's no hope to recover in a sane time, because excess connections are not even served by the servers, they're in the accept queue in the system, so they keep a port busy, slowing down connect() which means more even connections are needed for other incoming requests. If the LB is not properly sized and tuned, you'd rather just kill it to get rid of all the connections at once, wait a second or two for the RST storm to calm down and start again.

H2 can avoid that, at the expense of other issues I mentioned in another response above (i.e. don't multiplex too much to the servers, 5-10 streams max, to avoid the risk of inter-client HoL). But H2 also comes with higher xfer costs than H1 for large objects due to framing.

wtarreau · on Jan 3, 2025

I suspect it might feel indecent to tell others you suffer when you're both free and rich, and it's difficult for them to figure what's wrong with you.

Instead, people in such position should probably go out and join associations which distribute food to those who need it. At least they'll see that they're doing something good to improve others' condition and would probably feel better.

wtarreau · on Sept 12, 2024

What you're describing is for TCP. On TCP you can perform a write(64kB) and see the stack send it into 1460 segments. On UDP if you write(64kB) you'll get a single 64kB packet composed of 45 fragments. Needless to say, it suffices that any of them is lost in a buffer somewhere for the whole packet never being received and all of them having to be retransmitted by the application layer.

GSO on UDP allows the application to send a large chunk of data, indicating the MTU to be applied, and lets the kernel pass it down the stack as-is, until the lowest layer that can split it (network stack, driver or hardware). In this case they will make packets, not fragments. On the wire there will really be independent datagrams with different IP IDs. In this case, if any of them is lost, the other ones are still received and the application can focus on retransmitting only the missing one(s). In terms of route lookups, it's as efficient as fragmentation (since there's a single lookup) but it will ensure that what is sent over the wire is usable all along the chain, at a much lower cost than it would be to make the application send all of them individually.

wtarreau · on Sept 11, 2024

Nowadays the vast majority of CVEs have nothing to do with security, they're just Curriculum Vitae Enhancers, i.e. a student finding that "with my discovery, if A, B, C and D were granted, I could possibly gain some privileges", despite A/B/C/D being mutually exclusive. That's every days job for any security people to sort out that garbage. So what the kernel does is not worse at all.

wtarreau · on Sept 11, 2024

There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket. When you have small windows (thank you cubic), you'll just send a few datagrams this way and don't save much.

arghwhat · on Sept 11, 2024

> There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket.

Hmm? sendmsg takes the destination address in the `struct msghdr` structure, and sendmmsg takes an array of those structures.

At the same time, the discussion of efficiency is about UDP vs. TCP. TCP writes are per socket, to the connected peer, and so UDP has the upper hand here. The concerns were about how TCP allows giving a large buffer to the kernel in a single write that then gets sliced into smaller packets automatically, vs. having to slice it in userspace and call send more, which sendmmsg solves.

(You can of course do single-syscall or even zero-syscall "send to many" with io_uring for any socket type, but that's a different discussion.)

wtarreau · on Sept 12, 2024

> > There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket.

> Hmm? sendmsg takes the destination address in the `struct msghdr` structure, and sendmmsg takes an array of those structures.

But that's still pointless on a connected socket. And if you're not using connected sockets, you're performing destination lookups for each and every datagram you're trying to send. It also means you're running with small buffers by default (the 212kB default buffers per socket are shared with all your destinations, no longer per destination). Thus normally you want to use connected socket when dealing with UDP in environments having performance requirements.

wtarreau · on Sept 11, 2024

The default UDP buffers of 212kB are indeed a big problem for every client at the moment. You can optimize your server as you want, all your clients will experience losses if they pause for half a millisecond to redraw a tab or update an image, just because the UDP buffers can only store so few packets. That's among the things that must urgently change if we want UDP so start to work well on end-user devices.

wtarreau · on Sept 11, 2024

Something that nobody seems to be talking about here is the congestion control algorithm, which is the problem here. Cubic doesn't like losses. At all. In the kernel, pacing is implemented to minimise losses, allowing Cubic to work acceptably for TCP, but if the network is slightly lossy, the perfs are terrible anyway. QUIC strongly recommends to implement pacing but it's less easy to implement accurately in userland when you have to cross a whole chain than at the queue level in the kernel.

Most QUIC implementations use different variations around the protocol to make it behave significantly better, such as preserving the last metrics when facing a loss so that in case it was only a reorder, they can be restored, etc. The article should have compared different server-side implementations, with different settings. We're used to see a ratio of 1:20 in some transatlantic tests.

And testing a BBR-enabled QUIC implementation shows tremendous gains compared to TCP with Cubic. Ratios of 1:10 are not uncommon with moderate latency (100ms) and losses (1-3%).

At least what QUIC is enlightening is that if TCP has worked so poorly for a very long time (remember that the reason for QUIC was that it was impossible to fix TCP everywhere), it's in large part due to congestion control algorithms, and that since they were implemented in kernel by people carefully reading an academic paper that never considers reality but only in-lab measurements, such algorithms behave pretty poorly in front of the real internet where jitter, reordering, losses, duplicates etc are normal. QUIC allowed many developers to put their fingers in the algos, adjust some thresholds and mechanisms and we're seeing stuff improve fast (it could have improved faster if OpenSSL didn't decide to play against QUIC a few years ago by cowardly refusing to implement the API everyone needed, and imposing to rely on locally-built SSL libs to use QUIC). I'm pretty sure that within 2-3 years, we'll see some of the QUIC improvements ported to TCP, just because QUIC is a great playground to experiment with these algos that for 4 decades had been the reserved territory of just a few people who denied the net as it is and worked for the net how they dreamed it.

Look at this for example, it summarizes it all: https://huitema.wordpress.com/2019/11/11/implementing-cubic-...

wtarreau · on June 12, 2024

Not surprised. These animals are fascinating. We're not even sure we have caught everything from their language; maybe it's not just sound-based, and the way they shake their trump and ears or they dance counts a lot as well. I've long wondered if some of them developed some religions or even if they've overcome this need, reserving it to humans only.

Another study already showed that some monkeys have vocal words to designate a tiger and an eagle and use that to make all the group go up or down in the tree depending where the threat comes from. Elephants being more complex animals also living in groups are quite likely to have an even more elaborated language.

wtarreau · on June 7, 2024

I ran some tests on phi-3 and mistral-7b and it's not very hard to teach them to use tools, even though they were not designed for this. It turns out these models obey their instructions quite well and when you explain them that if they need to look up data on the net or to perform a calculation, they must formulate this demand with a specific syntax, they do a pretty good job. You just have to enable reverse-prompting so that the evaluation stops after their demand, your tools do the job (or you simulate it manually) and their task continues.

HN For You