Unhealthy container replacement now in Galaxy

Peter Wagner
Meteor Blog
Published in
2 min readJun 1, 2017

--

The health checking system is an important component of Galaxy’s reliability. If a container stops responding to requests, Galaxy automatically routes around the container until it recovers. Many Galaxy customers have requested a solution for containers that do not automatically recover. This may be the result of an oversized request, a race condition, an infinite loop, etc.

Today, we’re happy to announce automatic termination and replacement of unhealthy containers on Galaxy.

How it works

If a previously healthy container fails all health checks for 5 minutes, Galaxy will terminate the container and launch a replacement without any user intervention required. Freshly-launched containers that have never passed an initial health check get ten minutes before replacement, to allow for cache warming.

Unhealthy container replacement in action.

This update also exposes the time a container has spent in its current state, so you know exactly when Galaxy will be replacing it. This information can also be used to detect containers that became unhealthy, then recovered — they’ve been in the “Running” state for less than their launch time.

A container that hasn’t been relaunched… yet.

How to enable it

This feature is now live for any new applications created on Galaxy. You can opt in an existing application by using the “Enable automatic replacement of unhealthy containers” toggle on its “Settings” page on the Galaxy dashboard.

On June 8, 2017, we’ll enable unhealthy container replacement for all existing Galaxy applications. If you’re running an application that is expected to drop users requests for more than 5 minutes (i.e. a “worker”), or want to retain the ability to debug unhealthy containers before they’re replaced, we can leave this setting disabled for you if you contact support.

--

--