Cloud service providers face Meltdown and Spectre
Cloud service providers are facing a crisis at the start of 2018, as newly discovered but long-standing security vulnerabilities emerge at the CPU level across their extensive data center infrastructure estates. The hardware-based Meltdown and Spectre vulnerabilities create the possibility of sensitive data, including passwords, customer credit card details and intellectual property, being stolen from shared cloud infrastructure, such as containers using one kernel, as well as virtual machines accessing the same physical memory of a host machine. It also affects other devices with the flawed CPUs, including PCs and smartphones. The vulnerabilities affect chips from Intel, AMD and ARM, though much of the focus has been on Intel, given its reach and that Meltdown and Spectre have both been proven to affect its products. Leaks of the vulnerabilities, which had been under embargo, forced the early public disclosure on 3 January, ahead of the planned and coordinated release on 9 January. The processor vendors were informed six months earlier after the vulnerabilities were discovered by researchers led by Google’s Project Zero team. This is standard protocol, to allow enough time for assessment of the issues and for workarounds to be developed and ready for release to prevent exploitation by hackers and other bad actors.
Security vulnerabilities emerge on a regular basis and are effectively fixed by software updates and patches. The industry has faced security crises before with Heartbleed in 2014 after a bad update created vulnerabilities in the Open SSL encryption used by most of the leading websites. The discovered CPU vulnerabilities have always been present, due to a flaw in the architecture. So far, there have been no reports of information being compromised, though it is difficult to know for sure. But the primary concern with Meltdown and Spectre is that updating the microcode used by the CPUs is unlikely to fix the problems. Modifying operating systems and other applications is currently the only option. This will only mitigate the underlying hardware vulnerability to reduce security risk and chances of compromise, which means the only way to fix it is to replace all CPUs with the inherent design flaw. For cloud providers, this would mean a huge expense at a time when their infrastructure capital expenditure is growing faster than revenue. Expect them to demand significant compensation from Intel, and take legal action if terms cannot be agreed. But not replacing CPUs will affect system performance to some degree, especially workloads running on older CPUs, as the required operating system updates are significant.
The scale of the problem is immense. Meltdown potentially affects all Intel processors produced over the last two decades, including all x86 CPUs developed since 2013 that are installed in all the data centers operated by cloud providers. The vulnerability was first exposed in Intel CPUs that use a process to accelerate performance, called out-of-order execution, which retains data for processes in temporary caches within the operating system’s kernel memory. The Meltdown vulnerability is reportedly easier to exploit than Spectre, but emergency operating system updates for Linux, MacOS and Windows are already available to mitigate the risk by moving the kernel into its own protected virtual memory. Further patches are expected. The Spectre vulnerability is reportedly harder to exploit, but is inherent in more processors, including AMD and ARM designs as well as those from Intel. Spectre allows user-mode applications to extract information from other processes running on the same CPU by sidestepping address space isolation, which is meant to provide most of an operating system’s security. This is an issue primarily affecting cloud service providers. Solutions to mitigate Spectre are more complicated to implement, either requiring applications to be rewritten or patched to prevent attack from other programs, and also via future CPU microcode updates.
Intel faces the biggest consequences of the vulnerabilities, due to its monopoly, especially of CPUs used by the ‘super seven cloud builders’, namely Amazon, Microsoft, Google, Facebook, Alibaba, Baidu and Tencent, which get priority on supply and early access to the latest products. The super seven have the biggest financial and security exposure to the vulnerabilities. Collectively, they manage millions of Intel-based servers, which run their digital applications, including ecommerce, social networks and other consumer services, as well as cloud infrastructure and software services that business customers increasingly depend on. The top three alone buy more servers each quarter than either Dell EMC or HPE sell each quarter. Intel has prospered from the super seven’s huge capital expenditure outlay over the last five years, helping it change its business from being PC-centric to more data-centric. Intel’s Data Center Group grew 7% in Q3 2017 to US$4.9 billion, which accounted for 30% of total revenue. Its cloud service provider business grew 24%, highlighting the importance of this sector for growth as the PC market stagnates. Overall, cloud and communications service providers represent approximately 60% of the Data Center Group’s business, which has grown from 35% in 2013. Investor reaction to news of the disclosure was negative, as expected, with Intel’s share price falling 4%. In contrast, AMD’s share price grew 10% while Nvidia’s was up 7%.
The long-term financial ramifications for Intel are unclear. But it is likely the super seven and other leading cloud builders will look to speed up efforts to diversify their supply of CPUs by increasing investment in alternatives. They have had the last six months to consider their options. This will clearly benefit AMD, which launched its EPYC 7000 processor last year to challenge Intel in the data center sector, and has already had traction with Baidu and Tencent. Cloud builders have had little choice over the last decade, though Microsoft announced plans in March last year to develop ARM-based servers to reduce dependency on Intel. Expect AMD to announce at least one new super seven win in the first half of this year due to Intel’s problems. The disclosure will accelerate the super seven’s plans to replace existing CPUs in their data centers, especially older designs that will be most affected in terms of performance after the vital software updates. This could be a positive for Intel, if customers remain loyal and move to the latest Xeon Scalable processors. But pricing negotiations will be tough. Another scenario is that the super seven will assess their long-term CPU roadmap and take greater control of development. Acquiring their own CPU technology may be a compelling option, given their increasing expertise in designing hardware to meet their specific data center performance and cost requirements. Startups that have developed new AI-centric architectures, such as Graphcore and Knupath, may be targets.
Cloud providers’ PR teams will play an important role in managing the impact of Meltdown and Spectre, especially in maintaining customer confidence. AWS, Microsoft and Google have been quick to reassure customers that they are protected and not vulnerable. They were quick to highlight that no customer had reported being compromised, and indicated that service performance will not be affected, though some downtime is expected as instances need to be rebooted. None will want to be seen as weaker or less capable than the others in terms of dealing with this crisis. In fact, the cloud service providers have created a perception that there is no crisis and it is business as usual. But the exact details of the measures undertaken so far have been limited, though they do have to be careful not to give away too much information for hackers to exploit. Nevertheless, customers will be confused as to what is safe and not safe. Consequently, they will be assessing their own vulnerabilities and level of exposure, in particular, by auditing and tiering the sensitivity and importance of information that resides beyond their control in the cloud, especially on shared infrastructure. They will be paying more attention to which CPU is being used and increasingly demand to use those proven not to have the inherent design flaws as part of updated risk management strategies. It will also give further credence to adopting hybrid strategies and keeping sensitive information under greater control. This requires more granular data policy control and data to come back on their infrastructure. On-premises server and storage sales should benefit.
To date, businesses have not been deterred from migrating to the cloud, despite potential security risks, as well as compliance and regulatory concerns. The promise of instant, on-demand scalable infrastructure to drive innovation has prevailed. But this will change with evolving regulations. Regulators around the world, charged with ensuring data compliance and dealing with breaches, will be paying close attention to these events. They will be considering if existing guidelines and laws need modifying, and if new ones are required. Regulators in China, for example, will use this vulnerability to drive further investment in CPUs developed and built in-country by local vendors. One possible scenario is that regulators could start stipulating which CPUs can and cannot be used for processing customer data, as well as whether certain data can be processed and stored on shared or dedicated infrastructure. Governments will also be more concerned about the cloud services their departments and agencies use. Many governments already have frameworks in place, which will be tightened. In Europe, the upcoming GDPR laws, which come into effect on 25 May, places the emphasis on protecting customer data on both businesses and cloud service providers. Certainly, regulators will take a negative view if a future data breach is directly linked to the failure to protect against a known vulnerability or if the breach is directly linked to the continued use of infrastructure with a known vulnerability.
Performance is another key customer concern. Any decrease will result in higher costs, if customers are billed by the second and if workloads have to be moved to more expensive virtual machines due to higher CPU use rates. The alternative is for application codes to be optimized, which will be disruptive. Nevertheless, customers are being advised to update their runtime environments as well as their applications for each cloud service. All customers will suffer some inconvenience, irrespective of which cloud service provider they use. Expect an increase in demand for higher-performance CPUs, as well as more dedicated cloud infrastructure, as the risk of shared infrastructure becomes too high. If costs rise too much, customers to will look to move to a cheaper option or move back to their own infrastructure. Cloud builders will be forced to build new or replace existing infrastructure faster than planned to meet these changing dynamics. This will add to their existing capital expenditure challenges. The big three have the resources to help customers that are experiencing performance issues and absorb rising costs. They have had the last six months to prepare fixes and put in place contingency measures. But for smaller, more local tier-two and tier-three cloud service providers, the situation is unclear. They do not have the same level of resources, engineering, security, customer service and PR to deal with the situation quickly or efficiently enough. Many would not have known about the vulnerabilities until the public disclosure. Some will establish consortiums to work together and share resources. But smaller cloud service providers could lose business if customers lose confidence and levels of performance noticeably decline. This will only strengthen the big three’s position.
CPU vulnerabilities will present significant and long-term threats to cloud service providers. Irrespective of the measures they deploy, much of the security and vulnerability mitigation will depend on third parties. Highly capable state-sponsored hackers will no doubt be turning their attention to developing malware to target the vulnerabilities, especially Spectre. They have been inherent but undetected for years. New variants of Meltdown and Spectre, as well as other vulnerabilities, are likely to be discovered over the coming years, adding further concerns about using shared infrastructure. Attacks will focus on the rushed patches and software updates. Headlines claim this is the biggest threat to face the cloud industry. The same was true in 2014 with Heartbleed and the use of open source software. Heartbleed was dealt with and has largely been forgotten about. Meltdown and Spectre will be in the headlines over the coming weeks, but with many push and pull factors emerging in terms of cloud migration, the overall net effect is that most businesses will not be deterred from their current plans. This will change if a major compromise is found. But the latest threats may force many to think more carefully about what data is processed and stored in the cloud, especially on shared infrastructure. On-premises computing vendors, including Dell, HPE and Lenovo, should begin a major marketing campaign explaining that Spectre and Meltdown represent no risk to customers that have kept their computing in-house.