Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751803AbbBWGgN (ORCPT ); Mon, 23 Feb 2015 01:36:13 -0500 Received: from jablonecka.jablonka.cz ([91.219.244.36]:34579 "EHLO jablonecka.jablonka.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750903AbbBWGgM (ORCPT ); Mon, 23 Feb 2015 01:36:12 -0500 Date: Mon, 23 Feb 2015 07:35:53 +0100 From: Vojtech Pavlik To: Andrew Morton Cc: Jiri Kosina , Ingo Molnar , Josh Poimboeuf , Peter Zijlstra , Ingo Molnar , Seth Jennings , linux-kernel@vger.kernel.org, Linus Torvalds , Arjan van de Ven , Thomas Gleixner , Peter Zijlstra , Borislav Petkov , live-patching@vger.kernel.org Subject: Re: live kernel upgrades (was: live kernel patching design) Message-ID: <20150223063552.GA3675@suse.com> References: <20150221191607.GA9534@gmail.com> <20150221194840.GA10126@gmail.com> <20150222084601.GA23491@gmail.com> <20150222094639.GA23684@gmail.com> <20150222104841.GA25335@gmail.com> <20150222150148.3c566837.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150222150148.3c566837.akpm@linux-foundation.org> X-Bounce-Cookie: It's a lemon tree, dear Watson! User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2464 Lines: 55 On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote: > On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina wrote: > > > But if you ask the folks who are hungry for live bug patching, they > > wouldn't care. > > > > You mentioned "10 seconds", that's more or less equal to infinity to them. > > 10 seconds outage is unacceptable, but we're running our service on a > single machine with no failover. Who is doing this?? This is the most common argument that's raised when live patching is discussed. "Why do need live patching when we have redundancy?" People who are asking for live patching typically do have failover in place, but prefer not to have to use it when they don't have to. In many cases, the failover just can't be made transparent to the outside world and there is a short outage. Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Or trading systems, where the calculations must be strictly serialized and response times are counted in tens of microseconds. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. There are people who run realtime on them for that reason. Dumping all data and restarting the HPC cluster takes a lot of time and many nodes (out of tens of thousands) may not come back up, making the restore from media difficult. Doing a rolling upgrade causes the nodes one by one stall by 10+ seconds, which times 10k is a long time, too. And even the case where you have a perfect setup with everything redundant and with instant failover does benefit from live patching. Since you have to plan for failure, you have to plan for failure while patching, too. With live patching you need 2 servers minimum (or N+1), without you need 3 (or N+2), as one will be offline while during the upgrade process. 10 seconds of outage may be acceptable in a disaster scenario. Not necessarily for a regular update scenario. The value of live patching is in near zero disruption. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/