Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp1041692imm; Wed, 26 Sep 2018 10:37:35 -0700 (PDT) X-Google-Smtp-Source: ACcGV63Jknrm3YPeXafQvI2w71AAAVmTNnYL7Ofdg8166SrW2p6JAeOdkXeJqXSNEI68EIWs+9a1 X-Received: by 2002:a63:144b:: with SMTP id 11-v6mr6430515pgu.219.1537983454950; Wed, 26 Sep 2018 10:37:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537983454; cv=none; d=google.com; s=arc-20160816; b=qFXje37UwnP73jwzQFuvi/K3gkeyUOiOjUPr5Mh+NfucI2vC30pvTbBlPaFWbAOWCe QFQkuZjVoylaVGAf0u/Vd5wUAVSQpeNlIadRYPPEjTCHQPg8jFziUy6rHY0fEUScd6XQ YVG/edq1ssIs/QNb5ir3iF6wT7cGRqrCGd/mYhH1Qd73ibSLFsnKtUd+KeeIVw96/eWy Y51zeH8hAbd7wQY3wVD2fI/35wtzgUekfZKu5T+xBsUeIF3X3y7UZ0X8PrY8s17vqovr 24b+WTkE7YcbHLhHMvxxCMgoRYHraYtofm6T23SjhGIpZKg614kqwHLYbs7AR9WiclVw Vxdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:content-transfer-encoding :mime-version:user-agent:message-id:in-reply-to:date:references:cc :to:from; bh=aQxdoaU1Njhmkh+Mc+HGQ66z23LIvmncECNhcMQGoho=; b=OFjnD65Q8iB4Ndjyd8y7ajC2W5SPxvGejzrn2+oftdXq9ZciEkFKGDHtLRsm+RPPdd ibR4dFzjKLxUlt+H+FnSYrRKVmPfDehZNDf5zjX1OihwpejBHzrNE3vM3C6TcaYzKegg SylhTxlFIae3A/i0C3VMfHxQrwkgUOPGB5WIJJis4pQqzp58Ld6BRbyp0u3fXvR0TCzd Bq8g+KCYgZ2o9VFd2M+exPt353sISrrstJ8lV6dHIaZt0hV/rKv9Yvj0xKq70NjpgUa3 DNepmLIjXApqMjF44+lGlppPlPOB1WkaMvRd6EukBmWfJvgKYoTCWoJ3DaUnBHxL/utH Asug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l8-v6si6311469pls.13.2018.09.26.10.37.19; Wed, 26 Sep 2018 10:37:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728251AbeIZXux convert rfc822-to-8bit (ORCPT + 99 others); Wed, 26 Sep 2018 19:50:53 -0400 Received: from out02.mta.xmission.com ([166.70.13.232]:56292 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727484AbeIZXux (ORCPT ); Wed, 26 Sep 2018 19:50:53 -0400 Received: from in02.mta.xmission.com ([166.70.13.52]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1g5Dk2-0007GK-58; Wed, 26 Sep 2018 11:36:50 -0600 Received: from [105.184.227.67] (helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1g5Dk0-0001iH-On; Wed, 26 Sep 2018 11:36:49 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Andrey Vagin Cc: Dmitry Safonov , "linux-kernel\@vger.kernel.org" , Dmitry Safonov <0x7f454c46@gmail.com>, Adrian Reber , Andy Lutomirski , Christian Brauner , Cyrill Gorcunov , "H. Peter Anvin" , Ingo Molnar , Jeff Dike , Oleg Nesterov , Pavel Emelianov , Shuah Khan , Thomas Gleixner , "containers\@lists.linux-foundation.org" , "criu\@openvz.org" , "linux-api\@vger.kernel.org" , "x86\@kernel.org" , Alexey Dobriyan , "linux-kselftest\@vger.kernel.org" References: <20180919205037.9574-1-dima@arista.com> <874lej6nny.fsf@xmission.com> <20180924205119.GA14833@outlook.office365.com> <874leezh8n.fsf@xmission.com> <20180925014150.GA6302@outlook.office365.com> Date: Wed, 26 Sep 2018 19:36:29 +0200 In-Reply-To: <20180925014150.GA6302@outlook.office365.com> (Andrey Vagin's message of "Tue, 25 Sep 2018 01:42:02 +0000") Message-ID: <87zhw4rwiq.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-XM-SPF: eid=1g5Dk0-0001iH-On;;;mid=<87zhw4rwiq.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=105.184.227.67;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX195nFeBY1dP5GX8SCNpYG5l0MkCUdaCEO8= X-SA-Exim-Connect-IP: 105.184.227.67 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa08.xmission.com X-Spam-Level: * X-Spam-Status: No, score=1.8 required=8.0 tests=ALL_TRUSTED,BAYES_50, DCC_CHECK_NEGATIVE,T_TM2_M_HEADER_IN_MSG,XMNoVowels,XM_Body_Dirty_Words autolearn=disabled version=3.4.1 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4367] * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa08 1397; Body=1 Fuz1=1 Fuz2=1] * 0.5 XM_Body_Dirty_Words Contains a dirty word X-Spam-DCC: XMission; sa08 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;Andrey Vagin X-Spam-Relay-Country: X-Spam-Timing: total 495 ms - load_scoreonly_sql: 0.07 (0.0%), signal_user_changed: 2.6 (0.5%), b_tie_ro: 1.89 (0.4%), parse: 0.77 (0.2%), extract_message_metadata: 15 (3.0%), get_uri_detail_list: 3.1 (0.6%), tests_pri_-1000: 7 (1.4%), tests_pri_-950: 1.19 (0.2%), tests_pri_-900: 0.97 (0.2%), tests_pri_-400: 36 (7.2%), check_bayes: 34 (6.9%), b_tokenize: 11 (2.2%), b_tok_get_all: 13 (2.5%), b_comp_prob: 3.3 (0.7%), b_tok_touch_all: 4.9 (1.0%), b_finish: 0.68 (0.1%), tests_pri_-100: 5 (1.1%), check_dkim_signature: 0.48 (0.1%), check_dkim_adsp: 3.3 (0.7%), tests_pri_0: 416 (84.0%), tests_pri_10: 1.91 (0.4%), tests_pri_500: 6 (1.2%), rewrite_mail: 0.00 (0.0%) Subject: Re: [RFC 00/20] ns: Introduce Time Namespace X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andrey Vagin writes: > On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote: >> Andrey Vagin writes: >> >> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: >> >> Dmitry Safonov writes: >> >> >> >> > Discussions around time virtualization are there for a long time. >> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. >> >> > From that time, the topic appears on and off in various discussions. >> >> > >> >> > There are two main use cases for time namespaces: >> >> > 1. change date and time inside a container; >> >> > 2. adjust clocks for a container restored from a checkpoint. >> >> > >> >> > “It seems like this might be one of the last major obstacles keeping >> >> > migration from being used in production systems, given that not all >> >> > containers and connections can be migrated as long as a time dependency >> >> > is capable of messing it up.” (by github.com/dav-ell) >> >> > >> >> > The kernel provides access to several clocks: CLOCK_REALTIME, >> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the >> >> > start points for them are not defined and are different for each running >> >> > system. When a container is migrated from one node to another, all >> >> > clocks have to be restored into consistent states; in other words, they >> >> > have to continue running from the same points where they have been >> >> > dumped. >> >> > >> >> > The main idea behind this patch set is adding per-namespace offsets for >> >> > system clocks. When a process in a non-root time namespace requests >> >> > time of a clock, a namespace offset is added to the current value of >> >> > this clock on a host and the sum is returned. >> >> > >> >> > All offsets are placed on a separate page, this allows up to map it as >> >> > part of vvar into user processes and use offsets from vdso calls. >> >> > >> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME >> >> > clocks. >> >> > >> >> > Questions to discuss: >> >> > >> >> > * Clone flags exhaustion. Currently there is only one unused clone flag >> >> > bit left, and it may be worth to use it to extend arguments of the clone >> >> > system call. >> >> > >> >> > * Realtime clock implementation details: >> >> > Is having a simple offset enough? >> >> > What to do when date and time is changed on the host? >> >> > Is there a need to adjust vfs modification and creation times? >> >> > Implementation for adjtime() syscall. >> >> >> >> Overall I support this effort. In my quick skim this code looked good. >> > >> > Hi Eric, >> > >> > Thank you for the feedback. >> > >> >> >> >> My feeling is that we need to be able to support running ntpd and >> >> support one namespace doing googles smoothing of leap seconds while >> >> another namespace takes the leap second. >> >> >> >> What I was imagining when I was last thinking about this was one >> >> instance of struct timekeeper aka tk_core per time namespace. That >> >> structure already keeps offsets for all of the various clocks from >> >> the kerne internal time sources. What would be needed would be to >> >> pass in an appropriate time namespace pointer. >> >> >> >> I could be completely wrong as I have not take the time to completely >> >> trace through the code. Have you looked at pushing the time namespace >> >> down as far as tk_core? >> >> >> >> What I think would be the big advantage (besides ntp working) is that >> >> the bulk of the code could be reused. Allowing testing of the kernel's >> >> time code by setting up a new time namespace. So a person in production >> >> could setup a time namespace with the time set ahead a little bit and >> >> be able to verify that the kernel handles the upcoming leap second >> >> properly. >> >> >> > >> > It is an interesting idea, but I have a few questions: >> > >> > 1. Does it mean that timekeeping_update() will be called for each >> > namespace? This functions is called periodically, it updates times on the >> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an >> > overhead of this? >> >> I don't know if periodically is a proper characterization. There may be >> a code path that does that. But from what I can see timekeeping_update >> is the guts of settimeofday (and a few related functions). >> >> So it appears to make sense for timekeeping_update to be per namespace. >> >> Hmm. Looking at what is updated in the vsyscall_gtod_data it does >> look like you would have to periodically update things, but I don't know >> big that period would be. As long as the period is reasonably large, >> or the time namespaces were sufficiently deschronized it should not >> be a problem. But that is the class of problem that could make >> my ideal impractical if there is measuarable overhead. >> >> Where were you seeing timekeeping_update being called periodically? > > timekeeping_update() is called HZ times per-second: > > [ 67.912858] timekeeping_update.cold.26+0x5/0xa > [ 67.913332] timekeeping_advance+0x361/0x5c0 > [ 67.913857] ? tick_sched_do_timer+0x55/0x70 > [ 67.914409] ? tick_sched_do_timer+0x70/0x70 > [ 67.914947] tick_sched_do_timer+0x55/0x70 > [ 67.915505] tick_sched_timer+0x27/0x70 > [ 67.916042] __hrtimer_run_queues+0x10f/0x440 > [ 67.916639] hrtimer_interrupt+0x100/0x220 > [ 67.917305] smp_apic_timer_interrupt+0x79/0x220 > [ 67.918030] apic_timer_interrupt+0xf/0x20 Interesting. Reading the code the calling sequence there is: tick_sched_do_timer tick_do_update_jiffies64 update_wall_time timekeeping_advance timekeepging_update If I read that properly under the right nohz circumstances that update can be delayed indefinitely. So I think we could prototype a time namespace that was per timekeeping_update and just had update_wall_time iterate through all of the time namespaces. I don't think the naive version would scale to very many time namespaces. At the same time using the techniques from the nohz work and a little smarts I expect we could get the code to scale. I think this direction is definitely worth exploring. My experience with namespaces is that if we don't get the advanced features working there is little to no interest from the core developers of the code, and the namespaces don't solve additional problems. Which makes the namespace a hard sell. Especially when it does not solve problems the developers of the subsystem have. The advantage of timekeeping_update per time namespace is that it allows different lengths of seconds per time namespace. Which allows testing ntp and the kernel in interesting ways while still having a working production configuration on the same system. Eric