Tests show how much Meltdown fixes will hit Linux system performance
Now that the initial shock about the Spectre and Meltdown chip vulnerabilities has died down, the focus is very much on getting the problems sorted. As has been noted already, there has been concern about the impact on performance that the bug fixes will bring.
Intel has been eager to downplay any suggestion of major slowdown, but the exact performance hit will vary from system to system depending on the tasks being performed. Brendan Gregg -- a Netflix engineer whose work involves large scale cloud computing performance -- has conducted some tests into the impact patches will have on Linux systems, concluding that "patches that workaround Meltdown introduce the largest kernel performance regressions I've ever seen."
See also:
- Intel releases updated Spectre and Meltdown patches for Skylake systems
- Intel tells customers to stop installing Meltdown/Spectre patches due to 'unpredictable' reboot issues
- Intel promises transparency as Meltdown patch causes reboot problems with Broadwell and Haswell chips
- Intel releases benchmark results detailing Meltdown patch performance slowdown
Gregg has created a "simple microbenchmark" (for which he has provided the source code) which he used to test five factors he believes are key. He is looking specifically at the Linux kernel page table isolation (KPTI) patches, but points out that there are three other layers of overhead to consider as well (Intel microcode updates, cloud provider hypervisor changes, and retpoline compiler changes).
He summarizes five key factors that are involved in KPTI overheads:
- Syscall rate: there are overheads relative to the syscall rate, although high rates are needed for this to be noticable. At 50k syscalls/sec per CPU the overhead may be two percent, and climbs as the syscall rate increases. At my employer (Netflix), high rates are unusual in cloud, with some exceptions (databases).
- Context switches: these add overheads similar to the syscall rate, and I think the context switch rate can simply be added to the syscall rate for the following estimations.
- Page fault rate:adds a little more overhead as well, for high rates.
- Working set size (hot data): more than 10 Mbytes will cost additional overhead due to TLB flushing. This can turn a 1 percent overhead (syscall cycles alone) into a seven percent overhead. This overhead can be reduced by A) pcid, available in Linux 4.14, and B) Huge pages.
- Cache access pattern: the overheads are exacerbated by certain access patterns that switch from caching well to caching a little less well. Worst case, this can add an additional 10 percent overhead, taking (say) the percent percent overhead to 17 percent.
Brendan's findings suggest that Linux systems can expect to see a slowdown of between 0.1 and six percent. For small scale tasks, this will be fairly insignificant, but the same cannot be said for bigger tasks or large scale Linux systems. It's not all bad news, though. Intel has stressed that it should be possible to mitigate against any performance hit, and Gregg tends to agree. "I'm expecting we'll take that down to less than 2 percent with tuning," Brendan adds.
For a detailed breakdown of Gregg's findings, take a look at his blog post.