US-EAST-4 Data Centre Issue

Resolved
Resolved

We believe all workloads are in a stable state and we haven't noticed any issues over the last few hours.

We may reach out to individual customers to resolve specific issues with DNS etc that might have occurred during project migrations.

Thanks again for your patience while we've worked through this.

Avatar for
Recovering

We've just emailed the following message to impacted project members and are including it here for wider transparency:


Hi,

Matt and Joe here with an update: the US-EAST-4 cluster is now powered on again, and we’re seeing projects resuming normal service. We’re in the process of monitoring projects and checking everything is behaving as normal.

For projects who opt-ed to migrate to another cluster during the outage, you should now find a manual backup for each of your project environments on the Production/Staging > Backups page in the Servd dashboard. It contains the original database data from before the outage, and will be dated July 13th with a time of somewhere between 6:30-7:00. If you wish to replace your project’s current database data with the original data, please follow these steps to restore the backup. Note this will wipe any content changes, new entries or commerce orders that have been created in your project since it was migrated away from US-EAST-4.

For projects who opt-ed to remain in US-EAST-4 during the outage, your site should now be functioning normally but please reply to this email if you’re continuing to see issues.

Once we’ve managed to get the full facts around the root cause of the data center fire, we’ll be publishing a post on our blog.

As you can imagine, this has been a hugely stressful and upsetting incident for many people, including ourselves, and I want to thank everyone for their patience, compassion and extremely supportive messages that have been consistent throughout. It’s been powering us through.

If you have any other questions, concerns or follow-ups you’d like to discuss with us, please reply to this email and we’ll get back to you as soon as possible.

Avatar for
Recovering

All projects in the US-EAST-4 cluster are now online and appear to have resumed normal service. Let us know via live chat or support@servd.host if you're still experiencing issues.

Avatar for
Identified

The cluster has regained network accessibility and we're working on resolving a few remaining issues with non-migrated projects.

Avatar for
Identified

Services within the data centre are beginning to come back online.

Our first action will be to spin down any scheduled tasks (cron jobs) which may be running for any migrated projects, before creating database backups which can be subsequently imported into migrated project databases.

Avatar for
Identified

From Civo, the hosting provider for US-EAST-4:

"We have received an update from the data center operator.

They have completed the full site inspection with the fire marshal and the electrical inspector and utility power has been restored to the site.

They are now working to restore critical systems with the primary electrical equipment for the site powered up. Concurrently, they are beginning work to bring the mechanical plant online. Additional engineers from other facilities are on site this morning to expedite site turn up.

They plan on data halls, and customer equipment, including Civo’s, to be powered up in the late afternoon EDT (evening UTC).

We will look to update here by 21:00 UTC with details."

Avatar for
Identified

We've received a further update from the data centre:


Our site inspection this morning went well and we have been granted authorization to restore utility power to the site and are currently working on re-energizing utility power to the facility. Our onsite team is working with the fire marshal and electrical inspectors, ensuring electrical system safety as we prepare to bring utility power back to the site.

Once that is completed, we will work towards bringing up our critical infrastructure systems. This will take approximately 5 hours.

While we are working on that, we will also be working on our fire/life safety systems as we need to replace some smoke detectors and have a full inspection of the fire system prior to allowing customers to enter the facility.

We will be sending out hourly updates as we make progress on bringing the facility back online.

Avatar for
Identified

While we wait for the data center's safety audit at 9am EDT (13:00UTC), we're continuing to migrate projects out of the US-EAST-4 cluster.

If you'd like your project moved to a healthier cluster, please reach out via live chat or support@servd.host as soon as possible and we will provide immediate assistance migrating your project.

Thank you for everyone's patience and cooperation. We're extremely sorry about this incident, and really appreciate the kind words from everyone who has been in touch thus far.

Avatar for
Identified

We have had a further update from the data center:

"We have just finished the meeting with the fire marshal, electrical inspectors, and our onsite management. They have asked us to clean additional spaces and have also asked us to replace some components of the fire system. They have set a time to come back and review these requests at 9am EDT (13:00UTC) Wednesday. We are working to comply completely with these new requests with these vendors and are bringing in additional cleaning personnel onsite to make the fire marshal’s deadline.

In preparation for being able to allow clients onsite, the fire marshal has stated that we need to perform a full test of the fire/life safety systems which will be done after utility power has been restored and fire system components replaced. We have these vendors standing by for this work tomorrow. Assuming that all goes as planned, the earliest that power would be restored to servers would be late in the day Wednesday."

--

We are extremely sorry and frustrated about this further delay. Accordingly, as this outage is looking to be extensively prolonged for likely another 24 hours, we'll be reaching out to remaining users with projects hosted on us-east-4 with the aim of getting them migrated out of the us-east-4 cluster and into a healthier cluster.

If you're impacted, please reach out via live chat or support@servd.host as soon as possible and we will provide immediate assistance migrating your project.

Avatar for
Identified

The cluster currently remains offline, we've not yet received an update from the data center manager on the current timeline. When we do, we'll post it as soon as possible.


As mentioned in our last update, we can restore any impacted projects within a different cluster using the project's most recent backup if desired. This would require an update to DNS records to point traffic towards the new location for the project.

Let us know via live chat or support@servd.host if you'd like to move ahead with a transfer

Avatar for
Identified

The latest update we have received from the data center management:

The cleaning for the UPS room is still on going and progressing well. This is due to be completed in time for a site inspecting by the local fire services at 2PM EDT (18:00 UTC).

The data center management expect that the site will pass the requirements to start the powering-on of the site after this inspection.

We will update hear once we hear the results of the meeting and expect the next update by 19:00 UTC

Avatar for
Identified

Our latest update from the data centre was received at 13:30 UTC and is as follows:

The process to prepare the UPS room for re-powering the facility after the fire suppression has been on-going through the night, but continues. This is expected to take at least another 3 hours.

Once complete, local regulations require the approval of the fire marshal to begin the next stage in powering up the data center.


As mentioned in our last update, we can restore any impacted projects within a different cluster using the project's most recent backup if desired. This would require an update to DNS records to point traffic towards the new location for the project.

Let us know via live chat or support@servd.host if you'd like to move ahead with a transfer

Avatar for
Identified

We've not received any further updates on the status of the data centre recovery. We'll keep you posted as soon as we do.


As mentioned in our last update, we can restore any impacted projects within a different cluster using the project's most recent backup if desired. This would require an update to DNS records to point traffic towards the new location for the project.

Let us know via live chat or support@servd.host if you'd like to move ahead with a transfer

Avatar for
Identified

We've received another update from the data centre:

The fire marshal procedures require UPS facilities to be cleaned before the data center can be re-powered. This work has begun but they estimate the process will take several hours before any steps to bringing facilities back online.

The data center management expects to issue another update at 8am EDT (12:00 UTC)


As mentioned in our last update, we can restore any impacted projects within a different cluster using the project's most recent backup if desired. This would require an update to DNS records to point traffic towards the new location for the project.

Let us know via live chat or support@servd.host if you'd like to move ahead with a transfer

Avatar for
Identified

We have an update from the data centre:

An isolated fire in a single UPS in a dedicated electrical room was detected and put out by the automated fire suppression system. The local fire department were called and when they arrived on the scene, and per NEC guidelines, cut the power to the building.

Datacenter electricians are on site and awaiting permission from the fire department to access to the building to perform any repair works to the UPS systems and restoring main power to the building.

Once the electrical work is complete, the HVACs will be started to cool the facility. Once temperatures are within the required SLAs, a phased power on will be carried out to data halls.

We do not have a firm ETA for this process to be completed.


As mentioned in our last update, we can restore any impacted projects within a different cluster using the project's most recent backup if desired. This would require an update to DNS records to point traffic towards the new location for the project.

Let us know via live chat or support@servd.host if you'd like to move ahead with a transfer

Avatar for
Identified

We still don't have a firm update on ETA for restoring full availability in our us-east-4 cluster, however we can restore any impacted projects within a different cluster using the project's most recent backup if desired. This would require an update to DNS records to point traffic towards the new location for the project.

Let us know via live chat or support@servd.host if you'd like to move ahead with a transfer.

Avatar for
Identified

The data centre is still experiencing power issues resulting in inconsistent operation of networking infrastructure and servers. The problem is actively being worked on on-site.

Avatar for
Identified

We've been updated on the root cause of the problem which was due to a power fluctuation within the DC resulting in the reboot of networking devices. Engineers are working on getting everything back online currently.

Avatar for
Identified

Our DC providers have acknowledge the issue and are working on a resolution. We'll update here with an ETA or resolution once we have more details.

Avatar for
Investigating

We're investigating an issue with the data centre hosting our US-EAST-4 Cluster causing a loss of traffic for this cluster

Avatar for
Began at:

Affected components
  • Clusters
    • US-EAST-4