Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp994177imm; Fri, 28 Sep 2018 10:04:39 -0700 (PDT) X-Google-Smtp-Source: ACcGV600l3TvQOBvlxHVCS2+4JeaZ/IfKR9tJfNJsdHuFafviciImGIeqMbr8zREDs9HilX3ivux X-Received: by 2002:a17:902:a9cc:: with SMTP id b12-v6mr17068440plr.198.1538154279256; Fri, 28 Sep 2018 10:04:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538154279; cv=none; d=google.com; s=arc-20160816; b=gyFl16Ixn06ZQ7UFT6vHPItvbRg6OsRSu77sPacGkPCNNzJ0ISCkcjCOBa0CsMYehQ qV9CEKtV1IYFAwdmonl4v8nzn1kwgjqR46ptRL2NxPkbx1In4H2JAD6purgV6TM59mkI 8F9RgaF+2BU721mZIl7ffa/t1o7sj/eKUl0Amj3Vi2CvYUtwuPw7axOLHyez6V0I9srf iwNPO17HcG93MbpmyktNTGi8lLtzbtqogx+sXiDkpY0noEss1gBbjhswZlOSwvWbKGZU PpP0dlAtTMUQ7oM6yqcE1vEMs6Vvr4/4ZkEX0UeKHxB4jdTkPuZyhwWHPKwQZyB1egg2 Ktyg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:mime-version:user-agent :message-id:in-reply-to:date:references:cc:to:from; bh=JfrEs1F054f6PegFsbDJsPDyRbou/SQhxFLXcmADlkg=; b=jL+11qOtvpGHF4YqJfLqEf3JtWBvIaMNCkYzz+6hypodmX8A+ZkM3iHntnoqr4vOov mXXImsWqgb5cbUXKAcZZlYrNpW/gbIxlCcvEzE/UsFVNmE0CNa1SdHzeIAO9sAcH3RFr q68209bUwIOLFH9nMZ6mwiOxpkCHMiTsA7TpeosJXVozhEpe1awVtCIvTHWIopIrPz/N vO8t3J44PZKjqciPVmXvrXJ3kQhAMC7JvpFdaj5jRSK0u/xiB5cI04E4e5q4mk3d5mo0 7IHPeYTZH1Rj/sDFR9YukjhZLbE+hkKMbdGu+WSOvW/vBXVXtrnPJ9qk7P++BGZGbHXe tItA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b6-v6si1575369pgi.255.2018.09.28.10.04.24; Fri, 28 Sep 2018 10:04:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729506AbeI1X2t (ORCPT + 99 others); Fri, 28 Sep 2018 19:28:49 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:33295 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726971AbeI1X2r (ORCPT ); Fri, 28 Sep 2018 19:28:47 -0400 Received: from in02.mta.xmission.com ([166.70.13.52]) by out01.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1g5wBO-0001x0-OM; Fri, 28 Sep 2018 11:04:02 -0600 Received: from [105.184.227.67] (helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1g5wB8-0004cu-QC; Fri, 28 Sep 2018 11:04:02 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Thomas Gleixner Cc: Andrey Vagin , Dmitry Safonov , "linux-kernel\@vger.kernel.org" , Dmitry Safonov <0x7f454c46@gmail.com>, Adrian Reber , Andy Lutomirski , Christian Brauner , Cyrill Gorcunov , "H. Peter Anvin" , Ingo Molnar , Jeff Dike , Oleg Nesterov , Pavel Emelianov , Shuah Khan , "containers\@lists.linux-foundation.org" , "criu\@openvz.org" , "linux-api\@vger.kernel.org" , "x86\@kernel.org" , Alexey Dobriyan , "linux-kselftest\@vger.kernel.org" References: <20180919205037.9574-1-dima@arista.com> <874lej6nny.fsf@xmission.com> <20180924205119.GA14833@outlook.office365.com> <874leezh8n.fsf@xmission.com> <20180925014150.GA6302@outlook.office365.com> <87zhw4rwiq.fsf@xmission.com> Date: Fri, 28 Sep 2018 19:03:22 +0200 In-Reply-To: (Thomas Gleixner's message of "Thu, 27 Sep 2018 23:30:09 +0200 (CEST)") Message-ID: <87mus1ftb9.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1g5wB8-0004cu-QC;;;mid=<87mus1ftb9.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=105.184.227.67;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18XrQlhkDJRFtZr1+olyhrEsUocvzS50D8= X-SA-Exim-Connect-IP: 105.184.227.67 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa07.xmission.com X-Spam-Level: * X-Spam-Status: No, score=1.3 required=8.0 tests=ALL_TRUSTED,BAYES_50, DCC_CHECK_NEGATIVE,T_TM2_M_HEADER_IN_MSG,XMNoVowels autolearn=disabled version=3.4.1 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4930] * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;Thomas Gleixner X-Spam-Relay-Country: X-Spam-Timing: total 15036 ms - load_scoreonly_sql: 0.04 (0.0%), signal_user_changed: 6 (0.0%), b_tie_ro: 2.6 (0.0%), parse: 1.27 (0.0%), extract_message_metadata: 14 (0.1%), get_uri_detail_list: 3.6 (0.0%), tests_pri_-1000: 5 (0.0%), tests_pri_-950: 1.35 (0.0%), tests_pri_-900: 1.14 (0.0%), tests_pri_-400: 41 (0.3%), check_bayes: 39 (0.3%), b_tokenize: 15 (0.1%), b_tok_get_all: 12 (0.1%), b_comp_prob: 4.4 (0.0%), b_tok_touch_all: 5 (0.0%), b_finish: 0.70 (0.0%), tests_pri_-100: 6 (0.0%), check_dkim_signature: 0.66 (0.0%), check_dkim_adsp: 3.1 (0.0%), tests_pri_0: 432 (2.9%), tests_pri_10: 2.2 (0.0%), tests_pri_500: 14523 (96.6%), poll_dns_idle: 14508 (96.5%), rewrite_mail: 0.00 (0.0%) Subject: Re: [RFC 00/20] ns: Introduce Time Namespace X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thomas Gleixner writes: > On Wed, 26 Sep 2018, Eric W. Biederman wrote: >> Reading the code the calling sequence there is: >> tick_sched_do_timer >> tick_do_update_jiffies64 >> update_wall_time >> timekeeping_advance >> timekeepging_update >> >> If I read that properly under the right nohz circumstances that update >> can be delayed indefinitely. >> >> So I think we could prototype a time namespace that was per >> timekeeping_update and just had update_wall_time iterate through >> all of the time namespaces. > > Please don't go there. timekeeping_update() is already heavy and walking > through a gazillion of namespaces will just make it horrible, > >> I don't think the naive version would scale to very many time >> namespaces. > > :) > >> At the same time using the techniques from the nohz work and a little >> smarts I expect we could get the code to scale. > > You'd need to invoke the update when the namespace is switched in and > hasn't been updated since the last tick happened. That might be doable, but > you also need to take the wraparound constraints of the underlying > clocksources into account, which again can cause walking all name spaces > when they are all idle long enough. The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet. > From there it becomes hairy, because it's not only timekeeping, > i.e. reading time, this is also affecting all timers which are armed from a > namespace. > > That gets really ugly because when you do settimeofday() or adjtimex() for > a particular namespace, then you have to search for all armed timers of > that namespace and adjust them. > > The original posix timer code had the same issue because it mapped the > clock realtime timers to the timer wheel so any setting of the clock caused > a full walk of all armed timers, disarming, adjusting and requeing > them. That's horrible not only performance wise, it's also a locking > nightmare of all sorts. > > Add time skew via NTP/PTP into the picture and you might have to adjust > timers as well, because you need to guarantee that they are not expiring > early. > > I haven't looked through Dimitry's patches yet, but I don't see how this > can work at all without introducing subtle issues all over the place. Then it sounds like this will take some more digging. Please pardon me for thinking out load. There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds. Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source. There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC. I see two fundamental driving cases for a time namespace. 1) Migration from one node to another node in a cluster in almost real time. The problem is that CLOCK_MONOTONIC between nodes in the cluster has not relation ship to each other (except a synchronized length of the second). So applications that migrate can see CLOCK_MONOTONIC and CLOCK_BOOTTIME go backwards. This is the truly pressing problem and adding some kind of offset sounds like it would be the solution. Possibly by allowing a boot time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. 2) Dealing with two separate time management domains. Say a machine that needes to deal with both something inside of google where they slew time to avoid leap time seconds and something in the outside world proper UTC time is kept as an offset from TAI with the occasional leap seconds. In the later case it would fundamentally require having seconds of different length. A pure 64bit nanoseond counter is good for 500 years. So 64bit variables can be used to hold time, and everything can be converted from there. This suggests we can for ticks have two values. - The number of ticks from the time source. - The number of times the ticks would have rolled over. That sounds like it may be a little simplistic as it would require being very diligent about firing a timer exactly at rollover and not losing that, but for a handwaving argument is probably enough to generate a 64bit tick counter. If the focus is on a 64bit tick counter then what update_wall_time has to do is very limited. Just deal the accounting needed to cope with tick rollover. Getting the actual time looks like it would be as simple as now, with perhaps an extra addition to account for the number of times the tick counter has rolled over. With limited precision arithmetic and various optimizations I don't think it is that simple to implement but it feels like it should be very little extra work. For timers my inclination would be to assume no adjustments to the current time parameters and set the timer to go off then. If the time on the appropriate clock has been changed since the timer was set and the timer is going off early reschedule so the timer fires at the appropriate time. With the above I think it is theoretically possible to build a time namespace that supports multiple lengths of second, and does not have much overhead. Not that I think a final implementation would necessary look like what I have described. I just think it is possible with extreme care to evolve the current code base into something that can efficiently handle multiple time domains with slightly different lenghts of second. Thomas does it sound like I am completely out of touch with reality? It does though sound like it is going to take some serious digging through the code to understand how what everything does and how and why everthing works the way it does. Not something grafted on top with just a cursory understanding of how the code works. Eric