Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp603793imm; Fri, 21 Sep 2018 05:28:10 -0700 (PDT) X-Google-Smtp-Source: ANB0VdZw78v5jSFA29P1BmFWB+MCBoYgeAKMZEu7n4vwKd0pyr7/hhWfCCW9qLk5IhmImPU0uyRv X-Received: by 2002:a17:902:722:: with SMTP id 31-v6mr44047290pli.207.1537532890777; Fri, 21 Sep 2018 05:28:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537532890; cv=none; d=google.com; s=arc-20160816; b=ykejOAmVpA6UDPybgpSgkv7NViBa/ZXLNoyDSv7HTuI9qJVHtfOszeqc/CfHBxTZHF 3ohHPGs05AD7jBZng4sBktUm2Sb6LqssxMAkxWBoqtun1BEfC0P20VioYNKCGCl90F24 mMd6iUzyKlmd5jnbiHb0SDZQ6dSLMC6sw7akShGocPiWTKBUbPNfO9HJeBT0NFNMygEt lhBqskNJ/QF7hXH16fGbOMgO3iYv73DSkPUUA6x8N3TUP/n9bD5FrUDCmryzKWimmDW9 EQcZOvIGTeJelezv9qKymBozXgupaXgA6dJGwJkV7FaP+svIuYXwjq36hu2qFf5h6nyx vnzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:content-transfer-encoding :mime-version:user-agent:message-id:in-reply-to:date:references:cc :to:from; bh=rEOfvfMa3AFlpzR3uGxenDvnrsUB2/j4QrlCHEEMPug=; b=PSfmzl8RfZYgWCTCpCr+aM55mL5jQVAA3798uUg2THoucnYeFXDi0lJeYheFJD5JOE lLprl28pRIuoJHCLjwy/yopItuGYw3yoeOLBydhjNVSAOASEKqssMWjBv7m3nc9R/aCH +XeQu9m4Yg1qIcR7Gmf5lj0VF79PSE/8ONm4fheY53VRKL+nXxtP/gi+bC+zK6D+q+e2 D1cK10XgzL38DAJR7GyYBFjmVTJMFNZvvRdrvn9qWaRG7Pj9QDJK4FFnrtjAsyrVsflR y73bgGYVK2MVvDSGpjDJh6EbCiEy/Kw8W7yxxcntjbRXdzB7jbZZFcKs39i43jgdP2ow 9ORQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m14-v6si26418588pfh.92.2018.09.21.05.27.54; Fri, 21 Sep 2018 05:28:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389742AbeIUSQ3 convert rfc822-to-8bit (ORCPT + 99 others); Fri, 21 Sep 2018 14:16:29 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:39740 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728098AbeIUSQ2 (ORCPT ); Fri, 21 Sep 2018 14:16:28 -0400 Received: from in01.mta.xmission.com ([166.70.13.51]) by out03.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1g3KXD-0005wf-QN; Fri, 21 Sep 2018 06:27:47 -0600 Received: from [105.184.227.67] (helo=x220.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1g3KXC-0000Dq-5u; Fri, 21 Sep 2018 06:27:47 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Dmitry Safonov Cc: linux-kernel@vger.kernel.org, Dmitry Safonov <0x7f454c46@gmail.com>, Adrian Reber , Andrei Vagin , Andy Lutomirski , Christian Brauner , Cyrill Gorcunov , "H. Peter Anvin" , Ingo Molnar , Jeff Dike , Oleg Nesterov , Pavel Emelyanov , Shuah Khan , Thomas Gleixner , containers@lists.linux-foundation.org, criu@openvz.org, linux-api@vger.kernel.org, x86@kernel.org, Alexey Dobriyan , linux-kselftest@vger.kernel.org References: <20180919205037.9574-1-dima@arista.com> Date: Fri, 21 Sep 2018 14:27:29 +0200 In-Reply-To: <20180919205037.9574-1-dima@arista.com> (Dmitry Safonov's message of "Wed, 19 Sep 2018 21:50:17 +0100") Message-ID: <874lej6nny.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-XM-SPF: eid=1g3KXC-0000Dq-5u;;;mid=<874lej6nny.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=105.184.227.67;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+oUvqiaONCZEP8u31gjYWzcQt4iTAM4dc= X-SA-Exim-Connect-IP: 105.184.227.67 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa07.xmission.com X-Spam-Level: * X-Spam-Status: No, score=1.8 required=8.0 tests=ALL_TRUSTED,BAYES_50, DCC_CHECK_NEGATIVE,T_TM2_M_HEADER_IN_MSG,XMNoVowels,XM_Body_Dirty_Words autolearn=disabled version=3.4.1 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] * 0.5 XM_Body_Dirty_Words Contains a dirty word X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;Dmitry Safonov X-Spam-Relay-Country: X-Spam-Timing: total 660 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 2.9 (0.4%), b_tie_ro: 2.00 (0.3%), parse: 1.12 (0.2%), extract_message_metadata: 44 (6.7%), get_uri_detail_list: 7 (1.0%), tests_pri_-1000: 30 (4.6%), tests_pri_-950: 1.23 (0.2%), tests_pri_-900: 1.07 (0.2%), tests_pri_-400: 57 (8.6%), check_bayes: 56 (8.4%), b_tokenize: 23 (3.4%), b_tok_get_all: 15 (2.3%), b_comp_prob: 6 (0.9%), b_tok_touch_all: 6 (0.9%), b_finish: 2.1 (0.3%), tests_pri_0: 510 (77.3%), check_dkim_signature: 0.57 (0.1%), check_dkim_adsp: 2.6 (0.4%), tests_pri_500: 8 (1.2%), poll_dns_idle: 0.26 (0.0%), rewrite_mail: 0.00 (0.0%) Subject: Re: [RFC 00/20] ns: Introduce Time Namespace X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dmitry Safonov writes: > Discussions around time virtualization are there for a long time. > The first attempt to implement time namespace was in 2006 by Jeff Dike. > From that time, the topic appears on and off in various discussions. > > There are two main use cases for time namespaces: > 1. change date and time inside a container; > 2. adjust clocks for a container restored from a checkpoint. > > “It seems like this might be one of the last major obstacles keeping > migration from being used in production systems, given that not all > containers and connections can be migrated as long as a time dependency > is capable of messing it up.” (by github.com/dav-ell) > > The kernel provides access to several clocks: CLOCK_REALTIME, > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > start points for them are not defined and are different for each running > system. When a container is migrated from one node to another, all > clocks have to be restored into consistent states; in other words, they > have to continue running from the same points where they have been > dumped. > > The main idea behind this patch set is adding per-namespace offsets for > system clocks. When a process in a non-root time namespace requests > time of a clock, a namespace offset is added to the current value of > this clock on a host and the sum is returned. > > All offsets are placed on a separate page, this allows up to map it as > part of vvar into user processes and use offsets from vdso calls. > > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > clocks. > > Questions to discuss: > > * Clone flags exhaustion. Currently there is only one unused clone flag > bit left, and it may be worth to use it to extend arguments of the clone > system call. > > * Realtime clock implementation details: > Is having a simple offset enough? > What to do when date and time is changed on the host? > Is there a need to adjust vfs modification and creation times? > Implementation for adjtime() syscall. Overall I support this effort. In my quick skim this code looked good. My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second. What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer. I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core? What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly. I don't know about the vfs. I think the danger is being able to write dates in the future or in the past. It appears that utimes(2) and utimesnat(2) already allow this except for status change. So it is possible we simply don't care. I seem to remember that what nfs does is take the time stamp from the host writing to the file. I think the guide for filesystem timestamps should be to first ensure we don't introduce security issues, and then do what distributed filesystems do when dealing with hosts with different clocks. Given those those two guidlines above I don't think there is a need to change timestamsp the way the user namespace changes uid when displayed. As for the hardware like the real time clock we definitely should not let a root in a time namespace change it. We might even be able to get away with leaving the real time clock out of the time namespace. If not we need to be very careful how the real time clock is abstracted. I would start by leaving the real time clock hardware out of the time namespace and see if there is any part of userspace that cares. Eric > Cc: Dmitry Safonov <0x7f454c46@gmail.com> > Cc: Adrian Reber > Cc: Andrei Vagin > Cc: Andy Lutomirski > Cc: Christian Brauner > Cc: Cyrill Gorcunov > Cc: "Eric W. Biederman" > Cc: "H. Peter Anvin" > Cc: Ingo Molnar > Cc: Jeff Dike > Cc: Oleg Nesterov > Cc: Pavel Emelyanov > Cc: Shuah Khan > Cc: Thomas Gleixner > Cc: containers@lists.linux-foundation.org > Cc: criu@openvz.org > Cc: linux-api@vger.kernel.org > Cc: x86@kernel.org > > Andrei Vagin (12): > ns: Introduce Time Namespace > timens: Add timens_offsets > timens: Introduce CLOCK_MONOTONIC offsets > timens: Introduce CLOCK_BOOTTIME offset > timerfd/timens: Take into account ns clock offsets > kernel: Take into account timens clock offsets in clock_nanosleep > x86/vdso/timens: Add offsets page in vvar > x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow > posix-timers/timens: Take into account clock offsets > selftest/timens: Add test for timerfd > selftest/timens: Add test for clock_nanosleep > timens/selftest: Add timer offsets test > > Dmitry Safonov (8): > timens: Shift /proc/uptime > x86/vdso: Restrict splitting vvar vma > x86/vdso: Purge timens page on setns()/unshare()/clone() > x86/vdso: Look for vvar vma to purge timens page > timens: Add align for timens_offsets > timens: Optimize zero-offsets > selftest: Add Time Namespace test for supported clocks > timens/selftest: Add procfs selftest > > arch/Kconfig | 5 + > arch/x86/Kconfig | 1 + > arch/x86/entry/vdso/vclock_gettime.c | 52 +++++ > arch/x86/entry/vdso/vdso-layout.lds.S | 9 +- > arch/x86/entry/vdso/vdso2c.c | 3 + > arch/x86/entry/vdso/vma.c | 67 +++++++ > arch/x86/include/asm/vdso.h | 2 + > fs/proc/namespaces.c | 3 + > fs/proc/uptime.c | 3 + > fs/timerfd.c | 16 +- > include/linux/nsproxy.h | 1 + > include/linux/proc_ns.h | 1 + > include/linux/time_namespace.h | 72 +++++++ > include/linux/timens_offsets.h | 25 +++ > include/linux/user_namespace.h | 1 + > include/uapi/linux/sched.h | 1 + > init/Kconfig | 8 + > kernel/Makefile | 1 + > kernel/fork.c | 3 +- > kernel/nsproxy.c | 19 +- > kernel/time/hrtimer.c | 8 + > kernel/time/posix-timers.c | 89 ++++++++- > kernel/time/posix-timers.h | 2 + > kernel/time_namespace.c | 230 +++++++++++++++++++++++ > tools/testing/selftests/timens/.gitignore | 5 + > tools/testing/selftests/timens/Makefile | 6 + > tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++ > tools/testing/selftests/timens/config | 1 + > tools/testing/selftests/timens/log.h | 21 +++ > tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++ > tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++ > tools/testing/selftests/timens/timer.c | 95 ++++++++++ > tools/testing/selftests/timens/timerfd.c | 96 ++++++++++ > 33 files changed, 1272 insertions(+), 13 deletions(-) > create mode 100644 include/linux/time_namespace.h > create mode 100644 include/linux/timens_offsets.h > create mode 100644 kernel/time_namespace.c > create mode 100644 tools/testing/selftests/timens/.gitignore > create mode 100644 tools/testing/selftests/timens/Makefile > create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c > create mode 100644 tools/testing/selftests/timens/config > create mode 100644 tools/testing/selftests/timens/log.h > create mode 100644 tools/testing/selftests/timens/procfs.c > create mode 100644 tools/testing/selftests/timens/timens.c > create mode 100644 tools/testing/selftests/timens/timer.c > create mode 100644 tools/testing/selftests/timens/timerfd.c