The Problem with Generalizations: A Response to "The Problem with Benchmarks" by raindog308
I was going to post this in the Rants category, but would rather not confine my thoughts on this to only LES members (the Rants category requires you to be signed in to view the threads).
YABS – and many others like it over the years – attempts to produce a meaningful report to judge or grade the VM. It reports CPU type and other configuration information, then runs various disk, network, and CPU tests to inform the user if the VPS service he’s just bought is good, bad, or middling. But does it really?
I stumbled upon a blog post recently from raindog308 on the Other Green Blog and was amused that YABS was called out. Raindog states that YABS (and other benchmark scripts/tests like it) may be lacking in its ability to "produce a meaningful report to judge or grade the VM". Some of the reasons for discrediting and proposed alternatives had me scratching my head. I notice that raindog has been hard at work lately pumping up LEB with good content. But is he really?
I'm going to cherry pick some quotes and arguments to reply to below -
It’s valid to check CPU and disk performance for outliers. We’ve all seen overcrowded nodes. Hopefully, network performance is checked prior to purchase through test files and Looking Glass.
I'd argue that not all providers have readily available test files for download and/or a LG. Additionally, it can also be misleading when hosts simply link to their upstream's test files or have their LG hosted on a different machine/hardware that may not have the same usage patterns and port speeds that one would expect in the end-user's VM. However, the point is noted to do some due diligence and research the provider a bit, as that's certainly important.
I'd also argue that iperf (which YABS uses for the network tests) is much more complex than a simple test file/LG. If all you care about is a quick, single-threaded, single-direction, HTTP download then sure use the test file to your heart's content. BUT if you actually care about overall capacity and throughput to different areas of the world in BOTH directions (upload + download), then a multi-threaded, bi-directional iperf test can be much more telling of overall performance.
But other than excluding ancient CPUs and floppy-drive-level performance, is the user really likely to notice a difference day-in and day-out between a 3.3Ghz CPU and a 3.4Ghz one? Particularly since any operation touches many often-virtualized subsystems.
I actually laughed out loud at this comment. I guess I didn't realize that people use benchmarking scripts/tools to differentiate between a "3.3Ghz CPU and a 3.4Ghz one"... (Narrator: "they don't").
Providers have different ways that they fill up their nodes -- overselling CPU, disk space, network capacity, etc. is, more often than not, mandatory to keep prices low. Most (all?) providers are doing this in some form and the end-user most of the time is none the wiser as long as the ratios are done right and the end-user has resources available to meet their workload.
A benchmarking script/tool does help identify cases where the provider's nodes are oversubscribed to excess and is immediately obvious if disk speeds, network speeds, or CPU performance are drastically lower than they should be for the advertised hardware. Could this be a fluke and resolve itself just a few minutes later? Certainly possible. On the flip side, could performance of a system with good benchmark results devolve into complete garbage minutes/hours/days after the test is ran? Certainly possible as well. Multiple runs of a benchmark tool spread out over the course of hours, days, or weeks could potentially help identify if either of these cases are true.
On a personal note, I've seen dozens of instances where customers of various providers post their benchmark results and voice some concerns of system performance. And a large percentage of the time, the provider is then able to rectify the issues presented by fixing hardware issues, identifying abusers on the same node that are impacting performance, etc. From this, one can notice patterns starting to emerge where you can see the providers that take criticisms (via posts containing low-performing benchmarks) to better improve their services and ensure their customers are happy with the resources that they paid for. Other trends help identify providers to avoid where consistent low network speeds, CPU scores, etc. go unaddressed, indicating unhealthy overselling. But I digress...
If I could write a benchmark suite, here is what I would like it to report.
Here we get a rapid-fire of different un-quantifiable metrics that would be in raindog's ideal benchmarking suite:
- Reliability
- Problems
- Support
- Security
- "Moldy Oldies" (outdated VM templates)
- "Previous Residents" (previous owners of IPs)
- "My Neighbors" (anybody doing shitcoin mining on the same node)
Raindog realizes it'd be hard to get at these metrics -
Unfortunately, all of these things are impossible to quantify in a shell script.
They'd be impossible to quantify by any means (shell script or fortune teller)... Almost all of the above metrics are subject to personal opinions and preference. Some of these can be investigated by other means -- reliability: one could check out a public status page for that provider (if available); problems: one could search for public threads of people having issues and noting how the provider responds/resolves the issues; moldy oldies: a simple message/pre-sales ticket to the provider could alleviate that concern.
Anyways, the above metrics are highly subjective and are of varying importance to perspective buyers (someone might not give a shit about support response times or if their neighbor is having a party on their VM).
But what's something that everyone is actually concerned about? How the advertised VM actually performs.
And how do we assess system performance in a non-subjective manner? With benchmarking tests.
If you ask me which providers I recommend, the benchmarks that result from VMs on their nodes are not likely to factor into my response. Rather, I’ll point to the provider’s history and these “unquantifiables” as major considerations on which providers to choose.
That's great and, in fact, I somewhat agree here. Being an active member of the low end community, I have the luxury of knowing the providers that run a tight ship and care about their customers and which ones don't based on their track record. But not everyone has the time to dig through thousands of threads to assess a provider's support response or reliability and not everyone in the low end community has been around long enough to develop opinions and differentiate between providers that are "good" or "bad".
I also found it highly amusing that a related post on the same page links to another post regarding a new benchmarking series by jsg. This post is also written by raindog. The framing is a bit different in that post, where benchmarks are described in a way where they aren't represented as entirely useless and lacking in their ability to "produce a meaningful report to judge or grade the VM". I'm not really sure what happened in the few months between those two posts and they both talk about the limitations of benchmarking tool, but just would like to note the changes in tone.
My main point in posting this "response" (read: rant) is that benchmark tests aren't and shouldn't be the all-in-one source for determining if a provider and a VM are right for you. I don't advertise the YABS project in that manner. In the description of the tool I even state that YABS is "just yet another bench script to add to your arsenal." So I'm not really sure of the intent of raindog's blog post. Should users not be happy when they score a sweet $7/year server that has top-notch performance and post a corresponding benchmark showing how sexy it is? Should users not test their system to see if they are actually getting the advertised performance that they are expecting? Should users not use benchmarking tools as a means of debugging and bringing a provider's attention to any resource issues? Those "unquantifiables" that are mentioned certainly won't help you out there.
This response is now much longer than the original blog post that I'm responding to, so I'll stop here.
Happy to hear anyone else's thoughts or face the pitch forks on this one.
Humble janitor of LES
Proud papa of YABS
Comments
haha, good you vented a bit here.
disclaimer: I did not read the original post/article.
but as you essentially built yabs only for me ... exactly how I ordered it ... very on point so that everybody can understand and use it, I obviously have to add something here ;-)
I consider YABS perfect for what it is and does and if you'd had a tops statistic of people using it I would be very high up (around BF based on the DL numbers you posted somewhere I even calculated my usage of it to be in the lower single percent area, which is... a lot)
I have a hotkey set in xshell for it, sorry for the traffic mate :-P
however, I think it is a thing of expectation and using the right tool for what one wants to achieve. so obviously this also comes down to the individual use case and that might be a rather different one for raindog when he had the message in mind he wanted to transport with his post.
people trying to see a difference between a 3.3GHz and 3.4 GHz virtual core of a different cpu at a different provider are doing it wrong for sure. but as you already pointed out, that's nothing YABS or another benchmark can help with. it simply isn't the tool for it. same goes for network or disk performance.
a lot of people still struggle to understand fio results (which is ok, it is not a single number and a bit more complex) for instance and might misinterpret - however that's again nothing the benchmark can help with.
BUT, if you spent a moment to try and understand a few technical aspects Yabs is very good at providing comparable results and therefore of course add important information when comparing providers, especially with recurring test of different product ranges and by different users.
and because of the ease of use it made it quickly into a kind of wider spread disease addiction in the LE* world and therefore people posting their results helped building a base others can use and rely on to put their own results into a bigger picture. just like geekbench comparison, just with a decentralized database :-P
however that might not be all information you want or need when comparing - and most of the people creating benchmarks do not claim that anyway for what they offer. you also never did.
as said above I did not read the original blog post and probably won't anyway. so I can only guess that raindog wanted to actually reach a rather common goal of telling people to be careful about benchmark results because they can be misleading, especially for those who do not understand the possible differences to real world performance or what to make out of all these numbers.
yet it seems he got lost in details and wasn't able himself to find easy/real alternatives.
TL;DR;
exactly this. you can't measure support or whatnot with benchmark and probably never will, so what's the point.
however, happy he probably did not propose that complicated unreadable jsg bench as alternative. cheers 🍻
Totally agree here. While benchmarks might not be best at identifying if a machine is good for a particular use case, they are are good at identifying hugely oversold nodes (cough Racknerd) where performance hugely differs from the advertised product.
TensorDock: Hourly Cloud GPUs from $0.32/hour
I get why you are calling it out but at the same time, I wouldn't bother about it. @raindog308 is nothing more than a compliant donkey now that he is on the payroll over there. As for JSG, just a complete bellend that believes he is more important to LET than he actually is. Other than the odd supporter he is generally seen as, well, a bellend.
I understand why you felt compelled to reply, and your reply is convincing, but I see raindog308's post as a "let me push out another post quickly because I need to publish n posts per month in order to earn the $$ that the boss is paying me"-style post -- or something like this.
If he had reflected a bit more and framed his points differently, his post wouldn't have sounded as provocative. Already the title "The problem with benchmarks" was misleading: something like "The limitation of benchmarks" or "Benchmarks don't tell the whole story" would have been more accurate.
If pressed, I suspect that he would agree with your reply.
"A single swap file or partition may be up to 128 MB in size. [...] [I]f you need 256 MB of swap, you can create two 128-MB swap partitions." (M. Welsh & L. Kaufman, Running Linux, 2e, 1996, p. 49)
There is a topic on the OGF where jsg explains why his benchmark script is better then the rest.
I don’t remember the details. It was a long post and since I don’t visit anymore, I cant find the link to the topic.
I wouldn’t care much about what is written about your script. Take it for what it is, a free mention on a blog. Perhaps there was a link to your script? Then its still a win for you.
https://clients.mrvm.net
Well, what you need is a community based continuous benchmark service.
Means, you got a simple script, that does a few lightweight tests, checks specific parameters to check if node is kinda oversold ... you get the point.
You get enough people to install it and we get a giant inside who is the shittiest provider in the entire universe.
Free NAT KVM | Free NAT LXC
I am drunk and had to many ideas for this now, its to late.
Going to fucking build this.
Free NAT KVM | Free NAT LXC
jsg has made crappy comments about almost everyone's benchmarking scripts including YABS. I just think he's a narcissist always engaging in virtual pissing contests. I mean really his bench results are a pain in the arse to look at. I use YABS to check consistencies in VPS that I have and it works well enough.
I bench YABS 24/7/365 unless it's a leap year.
I think it was last year that jsg gave me grief because I asked him how his BM script was different compared to others. He gave me a Try-It-Yourself or get lost sermon.
I let it be.
VPS reviews | | MicroLXC | English is my nth language.
I think the latest one was on Contabo Benchmark thread
Github
Outside of the hosting world benchmarking is used all the time to make generalised observations of the quality, performance or accuracy of an object.
We can all make statements of edge cases that benchmarking may not necessarily be of use or important (i.e. the 3.3ghz vs. 3.4ghz example) but the reality is most people using YABS just want to know if they've bought a server that is operating as a Fiat or a Ferrari. For that YABS works well, especially in a market segment with such variation in quality and overselling between providers.
Benchmarking doesn't tell the whole story of a VM... a node... a provider... or a datacenter! That only comes with time, incidents happening, and and how said incidents are handled. A benchmark is purely one piece of the puzzle to determining whether a host is, as we all say, "prem".
Additionally benchmarks are good for only one thing: Directly comparing performance to another host with the same benchmark having been ran! Simple as that. It's all about relativity. There is no "that single core performance on this host is really bad", there is only "that single core performance on this host compared to this other host is really bad".
Anyways, just my two thoughts...
Cheap dedis are my drug, and I'm too far gone to turn back.
Is jsg the guy who always runs like 50 diifferent disk benchmarks? I vaguely remember something like that.
yabs is the first benchmark script I've found that approximates my personal methodology.
And it's written in a scripting language dumb cavemen like me can understand.
Thanks for the love
You've been a big proponent of YABS since the beginning and helped me immensely with understanding fio (how to use it, the right tests to perform, how to interpret the results, etc.). So cheers, my friend.
All I shall comment is, "Look how they massacred my boy"
Yeah, agreed. The general message behind the post seemed to be that "benchmarks are trash and not helpful, so disregard them entirely and instead focus on these unmeasurable things that I deem important." Like, I get it... benchmarks aren't the only factor to evaluate providers/VMs, but that doesn't make them useless.
Yeah, I didn't take it as a personal attack or anything like that. Just thought the arguments against using a benchmarking tool like my script were... well... rather lame. The referrals to the github page were actually what made me notice the blog post in the first place as I doubt many of us here are checking LEB on a regular basis
Oof. I'll be crying myself to sleep tonight
From one dumb caveman to another dumb caveman, thanks
Humble janitor of LES
Proud papa of YABS
TIL GitHub repos show referrers
Free NAT KVM | Free NAT LXC
You shouldn’t be allowed near vodka and redbull. And definitely not both of them at the same time.
Nexus Bytes Ryzen Powered NVMe VPS | NYC|Miami|LA|London|Netherlands| Singapore|Tokyo
Storage VPS | LiteSpeed Powered Web Hosting + SSH access | Switcher Special |
I am checking LEB on a regular basis and 99.999% of the time Racknerd is on the first page it's even higher than provider uptimes.
I bench YABS 24/7/365 unless it's a leap year.
I don't touch redbull, coffee addict tho.
I swear it was only a small sip, but this is how Hostballs was born so.
Free NAT KVM | Free NAT LXC
Exactly, benchmarks should not be the only thing buyers should be looking at. There are many other important factors, but we need more benches, not less, to bench the consistency.
BikeGremlin I/O
Mostly WordPress ™
Dustin and Biloh have history though, it's called money. Dustin is smart enough to be throwing a good chunk each month to Bilohs pocket in return for non-sinking threads, regular LEB/Youtube mentions. A stickied giveaway thread and so on.
I don't grudge him it, he knows how to market, although he slips up at times, it's hard to maintain lies when you are telling them all of the time.
Practice makes a man perfect?
There is nothing wrong in partnerships like what Biloh & Dustin have. What is wrong is not disclosing it & trying to pretend that it does not exist. Given Biloh's history, one should at least never trust any hosting company, board/forum & blog owned by him.
Recommend: MyRoot.PW|BuyVM|Inception Hosting|Prometeus