Manuel Bravo, Post-doctoral Researcher, IMDEA Software Institute
Transient resources—resources with transient availability offered by cloud providers at a discounted price—present an opportunity to reduce operational costs of running jobs in the cloud. Unfortunately, there are key problems that emerge when one attempts to use transient resources to reduce the cost of running time-constrained jobs in the cloud. Previous works fail to address these problems and are either not able to offer significant savings or miss termination deadlines. First, the fact that transient resources can be evicted, requiring the job to be re-started (even if not from scratch) may lead provisioning policies to fall-back to expensive on-demand configurations more often than desirable, or even to miss deadlines. Second, when a job is restarted, the new configuration can be different from the previous, which might make eviction recovery costly.
In this talk, we present Hourglass, a system that addresses these issues by combining two novel techniques: a slack-aware provisioning strategy that selects configurations considering the remaining time before the job’s termination deadline, and a fast reload mechanism to quickly recover from evictions. By switching to an on-demand configuration when (but only if) the target deadline is at risk of not being met, we are able to obtain significant cost savings while always meeting the deadlines. Our results show that, unlike previous work, Hourglass is able to significantly reduce the operating costs in the order of 60 − 70% while guaranteeing that deadlines are met.