July 27th Chinook Scheduled Outage

On July 27th there will be maintenance on the $CENTER1 file system to address slowdown and errors involving file operations. The system will need to be taken offline to apply a patch that we anticipate will resolve these issues. Jobs that are scheduled to run during this downtime will stay in the queue and be run after the downtime reservation. The Chinook HPC cluster and the Linux workstations will be affected by this outage.

Chinook00 Reboot

Chinook00 is being rebooted on July 12, 2017 at 3pm AKDT. This should be a brief outage, and logins during the time chinook00.alaska.edu is offline should be redirected to the other login nodes.

Web Services

Due to continued system troubles from the June 28 unplanned power outage, some web services may be experiencing glitches. RCS is currently troubleshooting to restore services.

RCS System Outage

As of June 29th, 8am RCS systems are steadily coming back online. It is currently unknown when all systems will be fully operational and it may extend past to the previous estimate of June 29th, 9am.

We will distribute further notifications as we assess our systems and can give a concrete estimate of when each system will be back online.

RCS System Outage

There was an unplanned power outage in the UAF Butro Data Center this morning. OIT and Facilities Services have replaced the critical equipment.

This was a hard power failure and Research Computing Systems (RCS) is currently assessing the impacts to our hardware and services.
Network on UAF campus has been restored and all RCS HPC, storage, and web services are planned to be back online by 9 AM AKST, June 29, 2017.

We will distribute notifications as more information is available.

Unplanned Chinook Outage

The $CENTER1 Lustre filesystem became temporarily unavailable to the Chinook compute nodes on May 11th around 3pm AKDT, causing some submitted jobs to fail immediately. To resolve this issue the job partitions were taken down, and any submitted jobs were placed into a waiting queue until the partitions were brought back online.

Any jobs that were in the process of running during that timeframe should be unaffected.


Subscribe to RSS - Outage