This article has been updated as of May 23rd to share new information.
What happened?
On Thursday, May 16th, there was a systems failure that caused a critical set of our machines to power cycle and go offline. Normally, when one machine is taken offline in this manner, there are other duplicate controller machines that perform the same tasks that would continue to maintain operations. The affected machines would then, under normal conditions, begin to automatically restore themselves.
Unfortunately, all but one of those duplicate control machines were affected simultaneously, and the remaining control machine was unable to maintain system stability. Communication within the cluster was broken, and the controllers that should have restored themselves were unable to do so. This resulted in our web-api, server management dashboard, and game clusters to be unreachable by players, resulting in an extended service outage.
On Thursday morning we immediately took action (within 3 minutes) and brought in all team members to dedicate their time to investigate the outage and begin restoring services. We also brought in additional DevOps engineers so that we could have several resolution strategies running simultaneously. The working solution was to use the remaining running controller machine as a sort of template to recreate the others manually. This manual rebuilding process took a while due to how precise every piece of the configuration had to be, down to the operating system, drives, and networking layers, and not just the Minecraft software we use.
At the same time, we had to be careful to keep other machines in a healthy state and to slowly bring all systems back in sync with each other to prevent data loss or systems instability that would cause us to start the process over. On Wednesday, May 22nd, we felt confident that we could restore some services for public access, and we brought the Minehut network game clusters, web-api, and server management dashboard back online.
As of Thursday, May 23rd, services have limited capacity as we continue to monitor and scale up the systems. This slow roll-out is to ensure that we do not overwhelm the cluster and cause another outage, and is expected to continue for several days. During this time period, some services may be slower than normal, including the server startup process which can take several minutes to complete. We will let the community know when all services are fully operational at 100% capacity.
How much longer will the downtime last?
Our services are back online! There are limits in place for server startups that we will be steadily increasing over the following days. We want to ensure that everything remains healthy and stable.
You can see more updates on our Discord server in the #status channel.
What are you doing to prevent this from happening again?
While no tech company can guarantee that they will never have an outage (even Google, Microsoft, and Twitter experience outages sometimes), we are taking steps to prevent this particular scenario from happening again. At this time we are actively monitoring the systems and maintaining fail-safes to keep the existing cluster stable. We are delaying most new feature development temporarily while we focus on maintaining stability.
We also have a separate team of engineers rebuilding our entire cluster in an isolated environment with entirely fresh and updated systems which we will migrate to later this year. Finally, we created a backup system we're calling "Libby" (a little lobby) which can at minimum allow players to join our lobbies and connect to a few trusted external partner servers in our community in the event of any future downtime.
The GamerSafer and Minehut team is dedicated to making services more stable as our highest priority, and we are delaying some other new feature development while we focus on this.
Is Minehut shutting down?
No, we plan on keeping Minehut running for years to come, this was a temporary outage.
Was this an attack?
No, this was not an attack on our systems. It was a mechanical issue on the physical machines.
Will I get a refund for my rank or server plan?
We understand that losing time on your servers was not in your plans, it wasn’t in ours either. We will be issuing a pro-rated (x2) refund for all plans and rank subs that were active during the downtime.
To calculate the refund amount, the cost of the plan/rank will be divided by the number of days in the plan to determine the cost per day. For each day we are down will refund double that daily cost.
All refunds should be issued by May 31 and will be automatically applied to your account. If you do not receive your refund by this date please contact the Support Team.
Will I get a refund for my ads?
Depending on the length of the outage we will either issue a pro-rated refund for all active campaigns or extend their duration.
Are my server files safe?
Yes, the server files are saved in a different location from the machines we are working on. There is a slight possibility for servers that were online, and on those specific machines, to experience a minor rollback. This is due to the way they were shut down abruptly, so the most recent auto-save would be restored.
If you discover any data loss, do not panic but shut down your server and contact Support immediately. Be sure to not try starting your server again until you have heard from Support. We typically have backups we can restore but if the server is started multiple times it can overwrite those backups.
I received an email that my servers were going to be deleted, but I can’t log in. Will I still have time to keep my server?
Yes, once our servers are back online, you can log in to your account and keep your server or download your files. No servers have been deleted yet, and no accounts are planned for deletion. For more information about the server deletion announcements, see our FAQ here.