@willie said:
Thanks @Virmach and much sympathy. Maybe you can bring on someone to help with this stuff. I guess I'll just have to wait for the non-Ryzen node to be sorted. My Ryzen VM is reachable and responds on port 22, but the ssh host key changed and I can no longer log into it, which makes it sound as if it's been reinstalled. I didn't try the VNC console.
It could be an IP conflict or IP change. If it continues let me know.
We've brought people on, we actually have someone helping but unfortunately they're not at the level we need them to be to really make an impact. There's also someone I'm supposed to be hiring for multiple months now from OGF but I basically have no time left to even go through the onboarding process. The most recent issue I described though wasn't a lack of time issue or anything like that, it's a QuadraNet breaking our transfers with nullroute and other DC being slow with hands request problem but it looks like it's finally about to get done for all servers except one, which I may have to revert.
SolusVM has migrated to either an Epyc or Ryzen servers, I don't remember. We got it like a year ago probably at this point and never used it until now. Let's see if this improves anything or if we're still stuck with PHP/MySQL bottlenecks.
@VirMach said:
The most recent issue I described though wasn't a lack of time issue or anything like that, it's a QuadraNet breaking our transfers with nullroute...
Ahh yes, the famous, premium, VEST Anti-DDoS technology! Was renting a dedi with them to use as a Plex server a few years back. Was using GDrive as storage at the time and every scan for new media would trigger their "protection" and null route my IP for 24 hours. Graphs they had showed network traffic at like 5 Gbps sustained despite being on a 1G port... it just didn't make any sense. Left after the third or fourth null route and moved on to greener pastures.
@VirMach said:
The most recent issue I described though wasn't a lack of time issue or anything like that, it's a QuadraNet breaking our transfers with nullroute...
Ahh yes, the famous, premium, VEST Anti-DDoS technology! Was renting a dedi with them to use as a Plex server a few years back. Was using GDrive as storage at the time and every scan for new media would trigger their "protection" and null route my IP for 24 hours. Graphs they had showed network traffic at like 5 Gbps sustained despite being on a 1G port... it just didn't make any sense. Left after the third or fourth null route and moved on to greener pastures.
I was already expecting their DDoS protection to be pretty bad, as in an attack would leak through. Never would I have guessed that it's so good that it doesn't let ANYTHING through. They've solved the universal problem of denial of service attack by beating the attacker to the punch. Truly remarkable.
Honestly my only complaint in all this is that they don't just offer 10Gbps unmetered traffic. After all, they'll never actually physically serve since anything above 30MB/s is a denial of service attack.
@AlwaysSkint said:
This ain't good: been down & up, then down again..
From Client Area, when trying to see why ATLZ is down.
Could Not Resolve Host: Solusvm.virmach.com
Some weird DNS issue that just went away. I don't remember who we set up for DNS but it's possible it briefly had a burst of connection issues in that route. Unrelated to SolusVM. It's WHMCS, it couldn't resolve Google either.
ATLZ still not opening though - too late for me now. Will check again in the morning.
Hope that this isn't the 3rd IP change - though looks as if the node has issues; sorry can't recall which one.
I noticed a new option in "Switch IP" when I logged into my friend's account. I think it will be useful for them, but I fear that many tickets will be opened...
@tototo said:
I noticed a new option in "Switch IP" when I logged into my friend's account. I think it will be useful for them, but I fear that many tickets will be opened...
Only if people give their friends their account password.
@Virmach That's both my ATLZ (a nameserver, 149.57.205.xxx) and DLSZ VPS (149.57.208.xxx) inaccessible in SolusVM. [Edit] Both are completely down. Open a Priority ticket for ATL, so that you can trace the node?
@tototo said:
I noticed a new option in "Switch IP" when I logged into my friend's account. I think it will be useful for them, but I fear that many tickets will be opened...
Only if people give their friends their account password.
Seattle finally got the switch configuration changes we requested. SEAZ004 and SEAZ008 having some issues with it (indirectly) so waiting on hands request for that, otherwise Seattle networking is finally decent.
ATLZ007 had a disk issue that's been fixed, as well as SEAZ010.
@AlwaysSkint said: Open a Priority ticket for ATL, so that you can trace the node?
Tickets are pretty much unfortunately useless at this point and we're handling everything absolutely independently from them for the time being. I still try to keep an eye out for anyone who mentions anything unique and where our intervention is required, such as data not having been moved over during migration, but those are a needle in a haystack.
@Emmet said:
My VMs keep being rebooted... Severe packet loss happens most days.
Will there be compensation account credits? Or should I just be thankful if my BF deals don't end up deadpooling?
Very vague so I'm unable to even figure out if you're on a location facing problems such as Seattle or if it's unique to your service. We'll allow people to request credits as long as they're very clear about it and basically, concise, when making the request. So basically if you said:
My VMs keep being rebooted... Severe packet loss happens most days.
Will there be compensation account credits?
We'd probably just have to close the ticket at this point for example if we get a ton of requests like that.
Probably needs to include:
Name of Node/service
What issue you faced
What duration
Reference of tickets if any reporting it or reference network status, monitoring, anything showing the issue
I'd love to make it easy on everyone and pretty much provide SLA for everyone affected at every level, but that'd basically be all of our revenue for the whole month at this point in a time where we're already spending a lot on new hardware, new colocation services, lots of emergency hands fees, working extra long, and not charging anyone any extra for the upgrade, while we actually actively stopped selling most new services to focus on the transition.
In most cases if you already are not paying full price for a plan on our website and have a special, we immediately lose more money just based on the request being put in but of course it's your right.
We just kindly ask people keep the reality of the situation in mind and consider showing some mercy where possible.
Anyway, that's not without saying that I haven't been thinking about it and trying to figure out the best way to make everyone happy and make it up to people, within the realm of possibility and without them having to make a ticket for SLA credit or service extensions.
@Mastodon said:
I'm supposed to be in NL. Still no luck getting the service online. I access the VPS through VNC.
Node: AMSD027
Main IP pings: false
Node Online: false
Service online: online
Operating System: linux-ubuntu-20.04-x86_64-minimal-latest-v2
Service Status:Active
GW / IP in VPS matches SolusVM.
(Sent a ticket on this, didn't get a response however.)
Any info and or remedy?
Cheers!
All I can immediately tell you is that it's not a node-wide issue and only 1 single person's service is "offline" and many are using networking and racking up bandwidth usage, with their service pinging. You might have to install a new OS if possible for the easiest fix, or make manual modifications and/or try reconfigure networking button maybe 3-4 times every 30 minutes just in case it's not going through properly.
There is technically a small chance of an IP conflict. We have scripts for it but they're not perfect, we run several rounds of them over and over where it goes through services that don't ping and reconfigures them.
I will be idling 2 or 3 more services next year but the exchange rate will increase the price by about 20-25%. I should have charged account credit in advance.
That said, they are far cheaper and I am happy (even consider for downtime due to migration.)
Tokyo is finally getting its long-overdue network overhaul. We had a good discussion about it and I want to kind of share what we believe is occurring right now and why.
Turns out, pretty much every network engineer I've spoken with agrees that large VLANs were not necessarily the issue. I know when previously discussing this everyone pretty much immediately pointed to that and we were initially skeptical of doing larger VLANs and now we're going back to smaller VLANs so it might appear that it was the only issue all along, so I just wanted to clarify what's actually going on.
Large VLANs are fine and independent tested at this point to not cause issues. No collisions or ARP issues generated from the large VLANs, confirmed. The switches are also more than capable in handling it.
The NICs are also not necessarily the issue. We've had multiple hosts confirm they've used the name integrated NIC with large VLANs without any issues.
The motherboard and everything else, also fine, even at higher density per server, but the caveat here is that the original host that told us they use larger VLANs didn't mention they had dedicated NICs.
All of our settings/configuration are also correct and nothing negative resulted from any custom changes on our end. In fact it's probably what's allowed us for the most part to have a relatively functional network.
So, basically:
Large VLAN, fine.
Motherboard/Ryzen and high VM quantity, fine.
NIC, fine.
Where the problem gets created is the combination of the NIC, with the high VM quantities, with the larger VLANs. Then everything goes out the door, and the problem is created. We still have not found the proper solution and I'm sure it exists, but the easy cop-out of treating the symptoms created is splitting the VLANs. So we're not actually solving the problem, we're just avoiding it. There are multiple ways we could avoid this and the route we've selected is splitting up the VLANs. We could have also just mixed up VMs per node in that there's a lower quantity per node (but larger VMs, so not necessarily lower usage.) We could have also gone with a different dedicated NIC, and we could have probably also utilized the other 1-3 ports per server to balance the traffic. All of these would probably just avoid the situation from existing that would lead to the problem, but we decided to split VLANs.
Good news is that we've pretty much confirmed it's not a VLAN size issue so some locations can still use larger VLANs which allows for a greater flexibility when it comes to migrating between the same location and keeping your IP address and all the other benefits that come with that. For higher quantity locations, such as Los Angeles, Seattle, Tokyo, and probably San Jose, it makes more sense to split it, at least until we find the correct solution.
It so happens that my nameserver is on that node. I got impatient after a few days, so reinstalled a due to be cancelled VM on ATLZ005. Sod's Law: just as I fully commissioned it, ATLZ007 came back online. Och well, I had nothing better to do.
[Note to senile self: must remember to reset nameserver DNS entries.]
@Virmach; a minor aside..
My titchy 256MB RAM Dallas VM is being a 'mare to reinstall. Primarily down to ISO/template availability, though I suspect 'modern' distros are just too bloated for it - previously ran Debian 8.
DALZ007 has just one (TinyCore) ISO available in Solus (i.e. no netboot.xyz) and I'm unable to get any Client Area (legacy) nor Solus (Ryzen) templates to boot. Either the VM doesn't start at all or I get the commonly mentioned, no CD or HDD boot problem.
Additionally, with Rescue Mode:
Failed to connect to server (code: 1011, reason: Failed to connect to downstream server)
[i was looking to transfer this VM out to another member but not in this broken state.]
@AlwaysSkint said: @Virmach; a minor aside..
My titchy 256MB RAM Dallas VM is being a 'mare to reinstall. Primarily down to ISO/template availability, though I suspect 'modern' distros are just too bloated for it - previously ran Debian 8.
DALZ007 has just one (TinyCore) ISO available in Solus (i.e. no netboot.xyz) and I'm unable to get any Client Area (legacy) nor Solus (Ryzen) templates to boot. Either the VM doesn't start at all or I get the commonly mentioned, no CD or HDD boot problem.
Additionally, with Rescue Mode:
Failed to connect to server (code: 1011, reason: Failed to connect to downstream server)
[i was looking to transfer this VM out to another member but not in this broken state.]
DALZ007 as well as a few others have been waiting probably nearly 2 weeks for networking to set up the VLAN properly. The way the teams function and how they do networking is not ideal in my opinion and we only have issues with this one partner. Others set it all up at the beginning, but for some reason this one seems to have a set of what I presume to be independent contractors that just do their own thing. I'm not going to name drop the provider but obviously based on previous information we know who Dallas is with right now. Nor am I trying to be negative towards them or complain.
After all at least for locations where we have our own switch, it's our fault for not just managing it ourselves, it's just something we don't have time to do right now. Dallas is one of the locations where we don't have our own switch though.
It's just weird to me that the networking guy seems to not have access to the port maps and has to request it from the non-networking guy and then he only essentially does what he's requested which is an OK way to operate, it just creates situations where servers will get racked and then won't have the switch configuration done and it's happened multiple times, I just didn't catch this one early on because for some reason this location was set up differently than the others, in that the public IP VLAN appears to be separate from the ... other public IP VLAN? Which also doesn't make sense because these aren't set up to even support tagged traffic so what's more likely is that these blocks never got added to the VLAN in the first place? I don't know. Waiting to hear back.
As for template syncs, there's probably been 20 syncs I've done so far, each one fails spectacularly for one reason or another. The way it's set up, if even one single node has a problem with it, it just freezes up permanently, there's no timeout set. SolusVM's coding, nothing I can do other than start it over and over. Plus it doesn't even move onto the next sync so I can't even segment it. So one node having temporary network problems, stuck forever. One node nullrouted due to incorrect DDoS setting at a provider, stuck forever.
Comments
It could be an IP conflict or IP change. If it continues let me know.
We've brought people on, we actually have someone helping but unfortunately they're not at the level we need them to be to really make an impact. There's also someone I'm supposed to be hiring for multiple months now from OGF but I basically have no time left to even go through the onboarding process. The most recent issue I described though wasn't a lack of time issue or anything like that, it's a QuadraNet breaking our transfers with nullroute and other DC being slow with hands request problem but it looks like it's finally about to get done for all servers except one, which I may have to revert.
SolusVM has migrated to either an Epyc or Ryzen servers, I don't remember. We got it like a year ago probably at this point and never used it until now. Let's see if this improves anything or if we're still stuck with PHP/MySQL bottlenecks.
Ahh yes, the famous, premium, VEST Anti-DDoS technology! Was renting a dedi with them to use as a Plex server a few years back. Was using GDrive as storage at the time and every scan for new media would trigger their "protection" and null route my IP for 24 hours. Graphs they had showed network traffic at like 5 Gbps sustained despite being on a 1G port... it just didn't make any sense. Left after the third or fourth null route and moved on to greener pastures.
Humble janitor of LES
Proud papa of YABS
my vps ip also has this issue (laxa018: 149.57.135.x)
So far no change on mine, unreachable right now, "node is locked" in client area an hour or so ago.
I was already expecting their DDoS protection to be pretty bad, as in an attack would leak through. Never would I have guessed that it's so good that it doesn't let ANYTHING through. They've solved the universal problem of denial of service attack by beating the attacker to the punch. Truly remarkable.
Honestly my only complaint in all this is that they don't just offer 10Gbps unmetered traffic. After all, they'll never actually physically serve since anything above 30MB/s is a denial of service attack.
just be focus to turn on TYOC028/TYOC030 node please, been down for many days/weeks are unacceptable,
09:29:04 up 29 days, 20:51
This ain't good: been down & up, then down again..
From Client Area, when trying to see why ATLZ is down.
lowendinfo.com had no interest.
Some weird DNS issue that just went away. I don't remember who we set up for DNS but it's possible it briefly had a burst of connection issues in that route. Unrelated to SolusVM. It's WHMCS, it couldn't resolve Google either.
ATLZ still not opening though - too late for me now. Will check again in the morning.
Hope that this isn't the 3rd IP change - though looks as if the node has issues; sorry can't recall which one.
lowendinfo.com had no interest.
I noticed a new option in "Switch IP" when I logged into my friend's account. I think it will be useful for them, but I fear that many tickets will be opened...
Only if people give their friends their account password.
@Virmach That's both my ATLZ (a nameserver, 149.57.205.xxx) and DLSZ VPS (149.57.208.xxx) inaccessible in SolusVM. [Edit] Both are completely down. Open a Priority ticket for ATL, so that you can trace the node?
lowendinfo.com had no interest.
Resellers have those anyway.
lowendinfo.com had no interest.
Be My Friend….
Godfather
VPS reviews | | MicroLXC | English is my nth language.
Duplicate
VPS reviews | | MicroLXC | English is my nth language.
It's nice to see their billing system is still up and running, while my VM still isn't.
I'm supposed to be in NL. Still no luck getting the service online. I access the VPS through VNC.
Node: AMSD027
Main IP pings: false
Node Online: false
Service online: online
Operating System: linux-ubuntu-20.04-x86_64-minimal-latest-v2
Service Status:Active
GW / IP in VPS matches SolusVM.
(Sent a ticket on this, didn't get a response however.)
Any info and or remedy?
Cheers!
Renewals, yes. But they stopped accepting new orders a few weeks ago until the issues are resolved.
My VMs keep being rebooted... Severe packet loss happens most days.
Will there be compensation account credits? Or should I just be thankful if my BF deals don't end up deadpooling?
Seattle finally got the switch configuration changes we requested. SEAZ004 and SEAZ008 having some issues with it (indirectly) so waiting on hands request for that, otherwise Seattle networking is finally decent.
ATLZ007 had a disk issue that's been fixed, as well as SEAZ010.
Tickets are pretty much unfortunately useless at this point and we're handling everything absolutely independently from them for the time being. I still try to keep an eye out for anyone who mentions anything unique and where our intervention is required, such as data not having been moved over during migration, but those are a needle in a haystack.
Very vague so I'm unable to even figure out if you're on a location facing problems such as Seattle or if it's unique to your service. We'll allow people to request credits as long as they're very clear about it and basically, concise, when making the request. So basically if you said:
We'd probably just have to close the ticket at this point for example if we get a ton of requests like that.
Probably needs to include:
I'd love to make it easy on everyone and pretty much provide SLA for everyone affected at every level, but that'd basically be all of our revenue for the whole month at this point in a time where we're already spending a lot on new hardware, new colocation services, lots of emergency hands fees, working extra long, and not charging anyone any extra for the upgrade, while we actually actively stopped selling most new services to focus on the transition.
In most cases if you already are not paying full price for a plan on our website and have a special, we immediately lose more money just based on the request being put in but of course it's your right.
We just kindly ask people keep the reality of the situation in mind and consider showing some mercy where possible.
Anyway, that's not without saying that I haven't been thinking about it and trying to figure out the best way to make everyone happy and make it up to people, within the realm of possibility and without them having to make a ticket for SLA credit or service extensions.
All I can immediately tell you is that it's not a node-wide issue and only 1 single person's service is "offline" and many are using networking and racking up bandwidth usage, with their service pinging. You might have to install a new OS if possible for the easiest fix, or make manual modifications and/or try reconfigure networking button maybe 3-4 times every 30 minutes just in case it's not going through properly.
There is technically a small chance of an IP conflict. We have scripts for it but they're not perfect, we run several rounds of them over and over where it goes through services that don't ping and reconfigures them.
I will be idling 2 or 3 more services next year but the exchange rate will increase the price by about 20-25%. I should have charged account credit in advance.
That said, they are far cheaper and I am happy (even consider for downtime due to migration.)
Tokyo is finally getting its long-overdue network overhaul. We had a good discussion about it and I want to kind of share what we believe is occurring right now and why.
Turns out, pretty much every network engineer I've spoken with agrees that large VLANs were not necessarily the issue. I know when previously discussing this everyone pretty much immediately pointed to that and we were initially skeptical of doing larger VLANs and now we're going back to smaller VLANs so it might appear that it was the only issue all along, so I just wanted to clarify what's actually going on.
Large VLANs are fine and independent tested at this point to not cause issues. No collisions or ARP issues generated from the large VLANs, confirmed. The switches are also more than capable in handling it.
The NICs are also not necessarily the issue. We've had multiple hosts confirm they've used the name integrated NIC with large VLANs without any issues.
The motherboard and everything else, also fine, even at higher density per server, but the caveat here is that the original host that told us they use larger VLANs didn't mention they had dedicated NICs.
All of our settings/configuration are also correct and nothing negative resulted from any custom changes on our end. In fact it's probably what's allowed us for the most part to have a relatively functional network.
So, basically:
Where the problem gets created is the combination of the NIC, with the high VM quantities, with the larger VLANs. Then everything goes out the door, and the problem is created. We still have not found the proper solution and I'm sure it exists, but the easy cop-out of treating the symptoms created is splitting the VLANs. So we're not actually solving the problem, we're just avoiding it. There are multiple ways we could avoid this and the route we've selected is splitting up the VLANs. We could have also just mixed up VMs per node in that there's a lower quantity per node (but larger VMs, so not necessarily lower usage.) We could have also gone with a different dedicated NIC, and we could have probably also utilized the other 1-3 ports per server to balance the traffic. All of these would probably just avoid the situation from existing that would lead to the problem, but we decided to split VLANs.
Good news is that we've pretty much confirmed it's not a VLAN size issue so some locations can still use larger VLANs which allows for a greater flexibility when it comes to migrating between the same location and keeping your IP address and all the other benefits that come with that. For higher quantity locations, such as Los Angeles, Seattle, Tokyo, and probably San Jose, it makes more sense to split it, at least until we find the correct solution.
It so happens that my nameserver is on that node. I got impatient after a few days, so reinstalled a due to be cancelled VM on ATLZ005. Sod's Law: just as I fully commissioned it, ATLZ007 came back online. Och well, I had nothing better to do.
[Note to senile self: must remember to reset nameserver DNS entries.]
lowendinfo.com had no interest.
Large VLAN without ARP overhead:
ip neigh
command to set static MAC address of the gateway.@Virmach; a minor aside..
My titchy 256MB RAM Dallas VM is being a 'mare to reinstall. Primarily down to ISO/template availability, though I suspect 'modern' distros are just too bloated for it - previously ran Debian 8.
DALZ007 has just one (TinyCore) ISO available in Solus (i.e. no netboot.xyz) and I'm unable to get any Client Area (legacy) nor Solus (Ryzen) templates to boot. Either the VM doesn't start at all or I get the commonly mentioned, no CD or HDD boot problem.
Additionally, with Rescue Mode:
[i was looking to transfer this VM out to another member but not in this broken state.]
lowendinfo.com had no interest.
DALZ007 as well as a few others have been waiting probably nearly 2 weeks for networking to set up the VLAN properly. The way the teams function and how they do networking is not ideal in my opinion and we only have issues with this one partner. Others set it all up at the beginning, but for some reason this one seems to have a set of what I presume to be independent contractors that just do their own thing. I'm not going to name drop the provider but obviously based on previous information we know who Dallas is with right now. Nor am I trying to be negative towards them or complain.
After all at least for locations where we have our own switch, it's our fault for not just managing it ourselves, it's just something we don't have time to do right now. Dallas is one of the locations where we don't have our own switch though.
It's just weird to me that the networking guy seems to not have access to the port maps and has to request it from the non-networking guy and then he only essentially does what he's requested which is an OK way to operate, it just creates situations where servers will get racked and then won't have the switch configuration done and it's happened multiple times, I just didn't catch this one early on because for some reason this location was set up differently than the others, in that the public IP VLAN appears to be separate from the ... other public IP VLAN? Which also doesn't make sense because these aren't set up to even support tagged traffic so what's more likely is that these blocks never got added to the VLAN in the first place? I don't know. Waiting to hear back.
As for template syncs, there's probably been 20 syncs I've done so far, each one fails spectacularly for one reason or another. The way it's set up, if even one single node has a problem with it, it just freezes up permanently, there's no timeout set. SolusVM's coding, nothing I can do other than start it over and over. Plus it doesn't even move onto the next sync so I can't even segment it. So one node having temporary network problems, stuck forever. One node nullrouted due to incorrect DDoS setting at a provider, stuck forever.